{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第一周作业\n",
    "## 书面作业\n",
    "\n",
    "1. 自行寻找资料，了解hyperopt库的基本用法，尝试编写一个小例子\n",
    "2. Quora(国外版的知乎)每天都有成千上万的人在上面进行各个方面问题的提问，那么当中肯定有很多重复的、类似的提问，如何判别这些提问是否问的同一个问题？  \n",
    "问题背景参考：https://www.kaggle.com/c/quora-question-pairs  \n",
    "（1）进行基本的数据探索，对数据的基本情况形式描述说明  \n",
    "（2）尝试从中提取一系列的有效特征，帮助解决该问题。  \n",
    "可以直接编写特征的提取说明，也可以编写代码（可以伪代码或者是可运行的python代码）进行特征的提取说明。要求从中提取出10个以上特征，不能照搬这周的特征提取方式，但可以借鉴。\n",
    "3. 对于上述问题，你觉得可以怎么解决问题？简述你的思路\n",
    "\n",
    "## 1 自行寻找资料，了解hyperopt库的基本用法，尝试编写一个小例子\n",
    "例子代码如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np \n",
    "import pandas as pd \n",
    "from sklearn.ensemble import RandomForestClassifier \n",
    "from sklearn import metrics\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.preprocessing import StandardScaler \n",
    "from hyperopt import tpe, hp, fmin, STATUS_OK,Trials\n",
    "from hyperopt.pyll.base import scope\n",
    "\n",
    "data = pd.read_csv(\"train.csv\")#https://www.kaggle.com/iabhishekofficial/mobile-price-classification?select=train.csv\n",
    "\n",
    "# 将数据拆分为特征和目标\n",
    "X = data.drop(\"price_range\", axis=1).values \n",
    "y = data.price_range.values\n",
    "\n",
    "# 标准化特征变量\n",
    "scaler = StandardScaler()\n",
    "X_scaled = scaler.fit_transform(X)\n",
    "\n",
    "#为优化定义参数空间\n",
    "space = {\n",
    "    \"n_estimators\": hp.choice(\"n_estimators\", [100, 200, 300, 400,500,600]),\n",
    "    \"max_depth\": hp.quniform(\"max_depth\", 1, 15,1),\n",
    "    \"criterion\": hp.choice(\"criterion\", [\"gini\", \"entropy\"]),\n",
    "}\n",
    "\n",
    "# 定义目标函数\n",
    "def hyperparameter_tuning(params):\n",
    "    clf = RandomForestClassifier(**params,n_jobs=-1)\n",
    "    acc = cross_val_score(clf, X_scaled, y,scoring=\"accuracy\").mean()\n",
    "    return {\"loss\": -acc, \"status\": STATUS_OK}\n",
    "\n",
    "# 初始化Trial 对象\n",
    "\n",
    "trials = Trials()\n",
    "\n",
    "best = fmin(\n",
    "    fn=hyperparameter_tuning,\n",
    "    space = space, \n",
    "    algo=tpe.suggest, \n",
    "    max_evals=100, \n",
    "    trials=trials\n",
    ")\n",
    "\n",
    "print(\"Best: {}\".format(best))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "100%|██████████████████████████████████████████████████████████| 100/100 [08:47<00:00,  5.27s/trial, best loss: -0.892]  \n",
    "Best: {'criterion': 1, 'max_depth': 14.0, 'n_estimators': 5}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2 Quora(国外版的知乎)每天都有成千上万的人在上面进行各个方面问题的提问，那么当中肯定有很多重复的、类似的提问，如何判别这些提问是否问的同一个问题？\n",
    "（1）进行基本的数据探索，对数据的基本情况形式描述说明  \n",
    "直接在后面Jupyter Notebook的**数据探索**章节（第4章）展开。  \n",
    "（2）尝试从中提取一系列的有效特征，帮助解决该问题。  \n",
    "直接在后面Jupyter Notebook的**特征工程**章节（第5章）展开。  \n",
    "## 3 对于上述问题，你觉得可以怎么解决问题？简述你的思路\n",
    "通过question1和question2构建出多种特征后，用机器学习算法训练模型进行分类问题处理。可以参见后面Jupyter Notebook的**数据建模**章节（第6章）展开。 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4 数据探索"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
    "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
    "execution": {
     "iopub.execute_input": "2021-07-10T14:49:57.489167Z",
     "iopub.status.busy": "2021-07-10T14:49:57.488714Z",
     "iopub.status.idle": "2021-07-10T14:49:57.493615Z",
     "shell.execute_reply": "2021-07-10T14:49:57.492493Z",
     "shell.execute_reply.started": "2021-07-10T14:49:57.489131Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np # linear algebra\n",
    "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1 数据集概览\n",
    "1. 整体Dimension情况；\n",
    "2. 观察缺失值情况；\n",
    "3. id的唯一性情况；\n",
    "4. 问题的重复情况等等。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:01.623513Z",
     "iopub.status.busy": "2021-07-10T11:53:01.623219Z",
     "iopub.status.idle": "2021-07-10T11:53:03.430346Z",
     "shell.execute_reply": "2021-07-10T11:53:03.429364Z",
     "shell.execute_reply.started": "2021-07-10T11:53:01.623485Z"
    }
   },
   "outputs": [],
   "source": [
    "df_train=pd.read_csv('train.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:03.432598Z",
     "iopub.status.busy": "2021-07-10T11:53:03.432278Z",
     "iopub.status.idle": "2021-07-10T11:53:10.083679Z",
     "shell.execute_reply": "2021-07-10T11:53:10.082724Z",
     "shell.execute_reply.started": "2021-07-10T11:53:03.43257Z"
    }
   },
   "outputs": [],
   "source": [
    "df_test0=pd.read_csv('test.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:10.085968Z",
     "iopub.status.busy": "2021-07-10T11:53:10.085568Z",
     "iopub.status.idle": "2021-07-10T11:53:10.104334Z",
     "shell.execute_reply": "2021-07-10T11:53:10.103414Z",
     "shell.execute_reply.started": "2021-07-10T11:53:10.085927Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>qid1</th>\n",
       "      <th>qid2</th>\n",
       "      <th>question1</th>\n",
       "      <th>question2</th>\n",
       "      <th>is_duplicate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>What is the step by step guide to invest in sh...</td>\n",
       "      <td>What is the step by step guide to invest in sh...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>What is the story of Kohinoor (Koh-i-Noor) Dia...</td>\n",
       "      <td>What would happen if the Indian government sto...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>How can I increase the speed of my internet co...</td>\n",
       "      <td>How can Internet speed be increased by hacking...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>Why am I mentally very lonely? How can I solve...</td>\n",
       "      <td>Find the remainder when [math]23^{24}[/math] i...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>Which one dissolve in water quikly sugar, salt...</td>\n",
       "      <td>Which fish would survive in salt water?</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>404285</th>\n",
       "      <td>404285</td>\n",
       "      <td>433578</td>\n",
       "      <td>379845</td>\n",
       "      <td>How many keywords are there in the Racket prog...</td>\n",
       "      <td>How many keywords are there in PERL Programmin...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>404286</th>\n",
       "      <td>404286</td>\n",
       "      <td>18840</td>\n",
       "      <td>155606</td>\n",
       "      <td>Do you believe there is life after death?</td>\n",
       "      <td>Is it true that there is life after death?</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>404287</th>\n",
       "      <td>404287</td>\n",
       "      <td>537928</td>\n",
       "      <td>537929</td>\n",
       "      <td>What is one coin?</td>\n",
       "      <td>What's this coin?</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>404288</th>\n",
       "      <td>404288</td>\n",
       "      <td>537930</td>\n",
       "      <td>537931</td>\n",
       "      <td>What is the approx annual cost of living while...</td>\n",
       "      <td>I am having little hairfall problem but I want...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>404289</th>\n",
       "      <td>404289</td>\n",
       "      <td>537932</td>\n",
       "      <td>537933</td>\n",
       "      <td>What is like to have sex with cousin?</td>\n",
       "      <td>What is it like to have sex with your cousin?</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>404290 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "            id    qid1    qid2  \\\n",
       "0            0       1       2   \n",
       "1            1       3       4   \n",
       "2            2       5       6   \n",
       "3            3       7       8   \n",
       "4            4       9      10   \n",
       "...        ...     ...     ...   \n",
       "404285  404285  433578  379845   \n",
       "404286  404286   18840  155606   \n",
       "404287  404287  537928  537929   \n",
       "404288  404288  537930  537931   \n",
       "404289  404289  537932  537933   \n",
       "\n",
       "                                                question1  \\\n",
       "0       What is the step by step guide to invest in sh...   \n",
       "1       What is the story of Kohinoor (Koh-i-Noor) Dia...   \n",
       "2       How can I increase the speed of my internet co...   \n",
       "3       Why am I mentally very lonely? How can I solve...   \n",
       "4       Which one dissolve in water quikly sugar, salt...   \n",
       "...                                                   ...   \n",
       "404285  How many keywords are there in the Racket prog...   \n",
       "404286          Do you believe there is life after death?   \n",
       "404287                                  What is one coin?   \n",
       "404288  What is the approx annual cost of living while...   \n",
       "404289              What is like to have sex with cousin?   \n",
       "\n",
       "                                                question2  is_duplicate  \n",
       "0       What is the step by step guide to invest in sh...             0  \n",
       "1       What would happen if the Indian government sto...             0  \n",
       "2       How can Internet speed be increased by hacking...             0  \n",
       "3       Find the remainder when [math]23^{24}[/math] i...             0  \n",
       "4                 Which fish would survive in salt water?             0  \n",
       "...                                                   ...           ...  \n",
       "404285  How many keywords are there in PERL Programmin...             0  \n",
       "404286         Is it true that there is life after death?             1  \n",
       "404287                                  What's this coin?             0  \n",
       "404288  I am having little hairfall problem but I want...             0  \n",
       "404289      What is it like to have sex with your cousin?             0  \n",
       "\n",
       "[404290 rows x 6 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:10.106051Z",
     "iopub.status.busy": "2021-07-10T11:53:10.10573Z",
     "iopub.status.idle": "2021-07-10T11:53:10.126879Z",
     "shell.execute_reply": "2021-07-10T11:53:10.126123Z",
     "shell.execute_reply.started": "2021-07-10T11:53:10.105993Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>test_id</th>\n",
       "      <th>question1</th>\n",
       "      <th>question2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>How does the Surface Pro himself 4 compare wit...</td>\n",
       "      <td>Why did Microsoft choose core m3 and not core ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Should I have a hair transplant at age 24? How...</td>\n",
       "      <td>How much cost does hair transplant require?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>What but is the best way to send money from Ch...</td>\n",
       "      <td>What you send money to China?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>Which food not emulsifiers?</td>\n",
       "      <td>What foods fibre?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>How \"aberystwyth\" start reading?</td>\n",
       "      <td>How their can I start reading?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2345791</th>\n",
       "      <td>2345791</td>\n",
       "      <td>How do Peaks (TV series): Why did Leland kill ...</td>\n",
       "      <td>What is the most study scene in twin peaks?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2345792</th>\n",
       "      <td>2345792</td>\n",
       "      <td>What does be \"in transit\" mean on FedEx tracking?</td>\n",
       "      <td>How question FedEx packages delivered?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2345793</th>\n",
       "      <td>2345793</td>\n",
       "      <td>What are some famous Romanian drinks (alcoholi...</td>\n",
       "      <td>Can a non-alcoholic restaurant be a huge success?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2345794</th>\n",
       "      <td>2345794</td>\n",
       "      <td>What were the best and worst things about publ...</td>\n",
       "      <td>What are the best and worst things examination...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2345795</th>\n",
       "      <td>2345795</td>\n",
       "      <td>What is the best medication equation erectile ...</td>\n",
       "      <td>How do I out get rid of Erectile Dysfunction?</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2345796 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         test_id                                          question1  \\\n",
       "0              0  How does the Surface Pro himself 4 compare wit...   \n",
       "1              1  Should I have a hair transplant at age 24? How...   \n",
       "2              2  What but is the best way to send money from Ch...   \n",
       "3              3                        Which food not emulsifiers?   \n",
       "4              4                   How \"aberystwyth\" start reading?   \n",
       "...          ...                                                ...   \n",
       "2345791  2345791  How do Peaks (TV series): Why did Leland kill ...   \n",
       "2345792  2345792  What does be \"in transit\" mean on FedEx tracking?   \n",
       "2345793  2345793  What are some famous Romanian drinks (alcoholi...   \n",
       "2345794  2345794  What were the best and worst things about publ...   \n",
       "2345795  2345795  What is the best medication equation erectile ...   \n",
       "\n",
       "                                                 question2  \n",
       "0        Why did Microsoft choose core m3 and not core ...  \n",
       "1              How much cost does hair transplant require?  \n",
       "2                            What you send money to China?  \n",
       "3                                        What foods fibre?  \n",
       "4                           How their can I start reading?  \n",
       "...                                                    ...  \n",
       "2345791        What is the most study scene in twin peaks?  \n",
       "2345792             How question FedEx packages delivered?  \n",
       "2345793  Can a non-alcoholic restaurant be a huge success?  \n",
       "2345794  What are the best and worst things examination...  \n",
       "2345795      How do I out get rid of Erectile Dysfunction?  \n",
       "\n",
       "[2345796 rows x 3 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_test0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:10.128231Z",
     "iopub.status.busy": "2021-07-10T11:53:10.127847Z",
     "iopub.status.idle": "2021-07-10T11:53:10.141535Z",
     "shell.execute_reply": "2021-07-10T11:53:10.140605Z",
     "shell.execute_reply.started": "2021-07-10T11:53:10.128203Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(404290, 6) (2345796, 3)\n"
     ]
    }
   ],
   "source": [
    "print(df_train.shape,df_test0.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-09T14:40:29.674054Z",
     "iopub.status.busy": "2021-07-09T14:40:29.673664Z",
     "iopub.status.idle": "2021-07-09T14:40:29.679898Z",
     "shell.execute_reply": "2021-07-09T14:40:29.678638Z",
     "shell.execute_reply.started": "2021-07-09T14:40:29.674018Z"
    }
   },
   "source": [
    "测试集比训练集大5倍多。  \n",
    "观察是否存在缺失值情况。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:10.143319Z",
     "iopub.status.busy": "2021-07-10T11:53:10.143016Z",
     "iopub.status.idle": "2021-07-10T11:53:10.338855Z",
     "shell.execute_reply": "2021-07-10T11:53:10.337918Z",
     "shell.execute_reply.started": "2021-07-10T11:53:10.143291Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "question1    1\n",
       "question2    2\n",
       "dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train.isnull().sum()[df_train.isnull().sum()>0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:10.341518Z",
     "iopub.status.busy": "2021-07-10T11:53:10.341217Z",
     "iopub.status.idle": "2021-07-10T11:53:10.402029Z",
     "shell.execute_reply": "2021-07-10T11:53:10.401099Z",
     "shell.execute_reply.started": "2021-07-10T11:53:10.341489Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>qid1</th>\n",
       "      <th>qid2</th>\n",
       "      <th>question1</th>\n",
       "      <th>question2</th>\n",
       "      <th>is_duplicate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>363362</th>\n",
       "      <td>363362</td>\n",
       "      <td>493340</td>\n",
       "      <td>493341</td>\n",
       "      <td>NaN</td>\n",
       "      <td>My Chinese name is Haichao Yu. What English na...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            id    qid1    qid2 question1  \\\n",
       "363362  363362  493340  493341       NaN   \n",
       "\n",
       "                                                question2  is_duplicate  \n",
       "363362  My Chinese name is Haichao Yu. What English na...             0  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train[df_train['question1'].isnull()==True]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:10.404111Z",
     "iopub.status.busy": "2021-07-10T11:53:10.403809Z",
     "iopub.status.idle": "2021-07-10T11:53:10.463768Z",
     "shell.execute_reply": "2021-07-10T11:53:10.462872Z",
     "shell.execute_reply.started": "2021-07-10T11:53:10.404082Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>qid1</th>\n",
       "      <th>qid2</th>\n",
       "      <th>question1</th>\n",
       "      <th>question2</th>\n",
       "      <th>is_duplicate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>105780</th>\n",
       "      <td>105780</td>\n",
       "      <td>174363</td>\n",
       "      <td>174364</td>\n",
       "      <td>How can I develop android app?</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>201841</th>\n",
       "      <td>201841</td>\n",
       "      <td>303951</td>\n",
       "      <td>174364</td>\n",
       "      <td>How can I create an Android app?</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            id    qid1    qid2                         question1 question2  \\\n",
       "105780  105780  174363  174364    How can I develop android app?       NaN   \n",
       "201841  201841  303951  174364  How can I create an Android app?       NaN   \n",
       "\n",
       "        is_duplicate  \n",
       "105780             0  \n",
       "201841             0  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train[df_train['question2'].isnull()==True]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看出：\n",
    "1. 训练集中没有question1和question2都是空的场景。\n",
    "2. 如果有一个问题是缺失值，is_duplicate就是0，也是符合预期的。  \n",
    "这些有缺失值的列需要删除，不然会影响后面的处理。  \n",
    "下面删除有缺失值的行。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:10.46539Z",
     "iopub.status.busy": "2021-07-10T11:53:10.465088Z",
     "iopub.status.idle": "2021-07-10T11:53:10.626984Z",
     "shell.execute_reply": "2021-07-10T11:53:10.626127Z",
     "shell.execute_reply.started": "2021-07-10T11:53:10.46536Z"
    }
   },
   "outputs": [],
   "source": [
    "df_train=df_train[df_train['question1'].isnull()==False]\n",
    "df_train=df_train[df_train['question2'].isnull()==False]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "同理，观察测试集的缺失值情况。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:10.629169Z",
     "iopub.status.busy": "2021-07-10T11:53:10.628441Z",
     "iopub.status.idle": "2021-07-10T11:53:11.597124Z",
     "shell.execute_reply": "2021-07-10T11:53:11.596166Z",
     "shell.execute_reply.started": "2021-07-10T11:53:10.629123Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "question1    2\n",
       "question2    4\n",
       "dtype: int64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_test0.isnull().sum()[df_test0.isnull().sum()>0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:11.599004Z",
     "iopub.status.busy": "2021-07-10T11:53:11.598593Z",
     "iopub.status.idle": "2021-07-10T11:53:11.857461Z",
     "shell.execute_reply": "2021-07-10T11:53:11.856437Z",
     "shell.execute_reply.started": "2021-07-10T11:53:11.598963Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>test_id</th>\n",
       "      <th>question1</th>\n",
       "      <th>question2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1046690</th>\n",
       "      <td>1046690</td>\n",
       "      <td>NaN</td>\n",
       "      <td>How I what can learn android app development?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1461432</th>\n",
       "      <td>1461432</td>\n",
       "      <td>NaN</td>\n",
       "      <td>How distinct can learn android app development?</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         test_id question1                                        question2\n",
       "1046690  1046690       NaN    How I what can learn android app development?\n",
       "1461432  1461432       NaN  How distinct can learn android app development?"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_test0[df_test0['question1'].isnull()==True]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:11.858891Z",
     "iopub.status.busy": "2021-07-10T11:53:11.858562Z",
     "iopub.status.idle": "2021-07-10T11:53:12.107203Z",
     "shell.execute_reply": "2021-07-10T11:53:12.106039Z",
     "shell.execute_reply.started": "2021-07-10T11:53:11.85886Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>test_id</th>\n",
       "      <th>question1</th>\n",
       "      <th>question2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>379205</th>\n",
       "      <td>379205</td>\n",
       "      <td>How I can learn android app development?</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>817520</th>\n",
       "      <td>817520</td>\n",
       "      <td>How real can learn android app development?</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>943911</th>\n",
       "      <td>943911</td>\n",
       "      <td>How app development?</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1270024</th>\n",
       "      <td>1270024</td>\n",
       "      <td>How I can learn app development?</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         test_id                                    question1 question2\n",
       "379205    379205     How I can learn android app development?       NaN\n",
       "817520    817520  How real can learn android app development?       NaN\n",
       "943911    943911                         How app development?       NaN\n",
       "1270024  1270024             How I can learn app development?       NaN"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_test0[df_test0['question2'].isnull()==True]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "测试集缺失值与训练集类型，我们也先删除这些行。删除的行放在df_test1中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:20:15.052754Z",
     "iopub.status.busy": "2021-07-10T12:20:15.052346Z",
     "iopub.status.idle": "2021-07-10T12:20:16.93219Z",
     "shell.execute_reply": "2021-07-10T12:20:16.931015Z",
     "shell.execute_reply.started": "2021-07-10T12:20:15.052719Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Series([], dtype: int64)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_test=df_test0[df_test0['question1'].isnull()==False]\n",
    "df_test=df_test[df_test['question2'].isnull()==False]\n",
    "df_test.isnull().sum()[df_test.isnull().sum()>0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:13.941056Z",
     "iopub.status.busy": "2021-07-10T11:53:13.940475Z",
     "iopub.status.idle": "2021-07-10T11:53:14.417174Z",
     "shell.execute_reply": "2021-07-10T11:53:14.416405Z",
     "shell.execute_reply.started": "2021-07-10T11:53:13.941011Z"
    }
   },
   "outputs": [],
   "source": [
    "df_test1=pd.concat([df_test0[df_test0['question1'].isnull()],df_test0[df_test0['question2'].isnull()]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:14.419102Z",
     "iopub.status.busy": "2021-07-10T11:53:14.418472Z",
     "iopub.status.idle": "2021-07-10T11:53:14.430954Z",
     "shell.execute_reply": "2021-07-10T11:53:14.43001Z",
     "shell.execute_reply.started": "2021-07-10T11:53:14.419057Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>test_id</th>\n",
       "      <th>question1</th>\n",
       "      <th>question2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1046690</th>\n",
       "      <td>1046690</td>\n",
       "      <td>NaN</td>\n",
       "      <td>How I what can learn android app development?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1461432</th>\n",
       "      <td>1461432</td>\n",
       "      <td>NaN</td>\n",
       "      <td>How distinct can learn android app development?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>379205</th>\n",
       "      <td>379205</td>\n",
       "      <td>How I can learn android app development?</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>817520</th>\n",
       "      <td>817520</td>\n",
       "      <td>How real can learn android app development?</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>943911</th>\n",
       "      <td>943911</td>\n",
       "      <td>How app development?</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1270024</th>\n",
       "      <td>1270024</td>\n",
       "      <td>How I can learn app development?</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         test_id                                    question1  \\\n",
       "1046690  1046690                                          NaN   \n",
       "1461432  1461432                                          NaN   \n",
       "379205    379205     How I can learn android app development?   \n",
       "817520    817520  How real can learn android app development?   \n",
       "943911    943911                         How app development?   \n",
       "1270024  1270024             How I can learn app development?   \n",
       "\n",
       "                                               question2  \n",
       "1046690    How I what can learn android app development?  \n",
       "1461432  How distinct can learn android app development?  \n",
       "379205                                               NaN  \n",
       "817520                                               NaN  \n",
       "943911                                               NaN  \n",
       "1270024                                              NaN  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_test1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在来看看，question是否有重复的，我们先把q1和q2合在一起统计一下有没有重复的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:14.432393Z",
     "iopub.status.busy": "2021-07-10T11:53:14.432101Z",
     "iopub.status.idle": "2021-07-10T11:53:14.572826Z",
     "shell.execute_reply": "2021-07-10T11:53:14.571883Z",
     "shell.execute_reply.started": "2021-07-10T11:53:14.432366Z"
    }
   },
   "outputs": [],
   "source": [
    "df_ques=pd.DataFrame(df_train['question1'].to_list()+df_train['question2'].to_list(),columns=['question'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:14.576514Z",
     "iopub.status.busy": "2021-07-10T11:53:14.576227Z",
     "iopub.status.idle": "2021-07-10T11:53:14.590726Z",
     "shell.execute_reply": "2021-07-10T11:53:14.58976Z",
     "shell.execute_reply.started": "2021-07-10T11:53:14.576487Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>What is the step by step guide to invest in sh...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>What is the story of Kohinoor (Koh-i-Noor) Dia...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>How can I increase the speed of my internet co...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Why am I mentally very lonely? How can I solve...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Which one dissolve in water quikly sugar, salt...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>808569</th>\n",
       "      <td>How many keywords are there in PERL Programmin...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>808570</th>\n",
       "      <td>Is it true that there is life after death?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>808571</th>\n",
       "      <td>What's this coin?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>808572</th>\n",
       "      <td>I am having little hairfall problem but I want...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>808573</th>\n",
       "      <td>What is it like to have sex with your cousin?</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>808574 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 question\n",
       "0       What is the step by step guide to invest in sh...\n",
       "1       What is the story of Kohinoor (Koh-i-Noor) Dia...\n",
       "2       How can I increase the speed of my internet co...\n",
       "3       Why am I mentally very lonely? How can I solve...\n",
       "4       Which one dissolve in water quikly sugar, salt...\n",
       "...                                                   ...\n",
       "808569  How many keywords are there in PERL Programmin...\n",
       "808570         Is it true that there is life after death?\n",
       "808571                                  What's this coin?\n",
       "808572  I am having little hairfall problem but I want...\n",
       "808573      What is it like to have sex with your cousin?\n",
       "\n",
       "[808574 rows x 1 columns]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ques"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:14.592298Z",
     "iopub.status.busy": "2021-07-10T11:53:14.591976Z",
     "iopub.status.idle": "2021-07-10T11:53:16.330306Z",
     "shell.execute_reply": "2021-07-10T11:53:16.329351Z",
     "shell.execute_reply.started": "2021-07-10T11:53:14.592271Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "What are the best ways to lose weight?                                                                                       161\n",
       "How can you look at someone's private Instagram account without following them?                                              120\n",
       "How can I lose weight quickly?                                                                                               111\n",
       "What's the easiest way to make money online?                                                                                  88\n",
       "Can you see who views your Instagram?                                                                                         79\n",
       "                                                                                                                            ... \n",
       "Should poor people get less jail time than the rich or vice versa?                                                             2\n",
       "What do you love about Portugal?                                                                                               2\n",
       "Why do straight women who put on makeup and dress to attract attention consider it disrespectful when they get attention?      2\n",
       "What is the correct answer 90/2+1-2*5?                                                                                         2\n",
       "Why do Russians drink so much Vodka?                                                                                           2\n",
       "Name: question, Length: 111870, dtype: int64"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ques.question.value_counts()[df_ques.question.value_counts()>1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有不少重复的问题，下面可视化展示一下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:16.331821Z",
     "iopub.status.busy": "2021-07-10T11:53:16.33153Z",
     "iopub.status.idle": "2021-07-10T11:53:18.083493Z",
     "shell.execute_reply": "2021-07-10T11:53:18.082565Z",
     "shell.execute_reply.started": "2021-07-10T11:53:16.331795Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<ipython-input-20-c365111620ec>:4: MatplotlibDeprecationWarning: The 'nonposy' parameter of __init__() has been renamed 'nonpositive' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.\n",
      "  plt.yscale('log', nonposy='clip')\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Text(0, 0.5, 'Number of questions')"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAtMAAAFNCAYAAADCcOOfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAsf0lEQVR4nO3deZhkZXn38e+PXREHFdyAcdBBFJcYbXFNREUFYcAlvkKMUYJMMMElUeNojOKSiFETTUTNCIhxgRBfF0YwuItxZRGURXTEUUYQUHRAfEWR+/3jnMaapqu7qrprqor5fq6rrqnznO0+T1VN3/XUfc5JVSFJkiSpf1uMOgBJkiRpUplMS5IkSQMymZYkSZIGZDItSZIkDchkWpIkSRqQybQkSZI0IJNpSYsqyR8luWTUcYxSGu9N8vMk3xh1PN0keXeSfxh1HJI0yUympVuJJOuS7DvkfeyTZP0s7V9I8jyAqvpSVe3Zw7aOTvKBYcQ5Bh4NPAHYtar2HnUwAEmem+R/O9uq6siqev2oYtLoJKkky0cdh3RrYDIt6VYnyVYjDuEewLqqun7EcaiLMXiPSLqVMJmWbuWSbJvkbUkubx9vS7Jtx/y/S3JFO+95Cx2xmjl6neTlSX6c5LoklyR5fJL9gFcCz0zyyyTnt8vePcmpSa5JsjbJER3buU2S97WlExe3cXfuZ127r28B1yfZKsmqJN9v931Rkqd2LP/cJF9O8q9JfpHk0iSPbNsvS3JVkufMcZyzxprkcOA44BHtsb12lnW3TPKWJD9t9/vXbb9v1XEs+3Ysv9EofpKHJ/lKG/f5SfaZcVyXtsf8gyTPSnJf4N0dMf2iXfbEJG/oWPeI9liuaY/t7h3zKsmRSb7XvgbHJkmXvtk7yVfb+K5I8o4k28zY1gvbOH+a5M1Jtpjxuvx7kg1JvpPk8R3rLklyfLvdHyd5Q5It23n3SvK5JD9rt/vBJDsu8D3yv+1r9fO2P/fvmH/HNOU8l7fzP9Yx78Ak57V98JUkD5ytr9pl75fk022/X5nklW17189uZvmlIR2f3fa1PTbJae2xfT3Jvdp5Z7arnN++H56ZZKckn2jjvSbJl6ZfE0nzqCofPnzcCh7AOmDfWdpfB3wNuDOwM/AV4PXtvP2AnwD3A24LvB8oYHmXfewDrJ+l/QvA82YuA+wJXAbcvZ1eBtyrfX408IEZ2/ki8E5gO+BBwNXA49t5x7Tz7wDsCnyrM5b2+M8DdgNu07Y9A7g7zcDBM4Hrgbu1854L3AgcBmwJvAH4EXAssC3wROA64HZd+mKuWJ8L/O8cr9WRwHfaWO8IfL7t961mey07+wrYBfgZ8OT2uJ7QTu8MbA9cC+zZLns34H7dYgJOBN7QPn8c8FPgwe3x/ztwZseyBXwC2BFY2h7vfl2O7yHAw4Gt2tf8YuDFM7b1+fbYlwLf5ffvn+nX5W+ArdvXbQNwx3b+x4D/aI/1zsA3gL9s5y1v+2Pbtj/OBN62wPfIb4EjaN4jzwcuB9LOPw34L5r35NbAY9r2BwNXAQ9r13tOu+9tZ+mrHYArgJfQvJd2AB7Ww2d3ttfz5s9u+9peA+zdvg4fBE6ebdl2+o00X7i2bh9/NH2cPnz4mPsx8gB8+PCxOA+6J9PfB57cMf0kmhIEgBOAN3bMWz7zj+yMbe0D3AT8YsbjRmZPppe3ScW+wNYztnU0Hcl0m+D8Dtiho+2NwInt80uBJ3XMex63TKb/Yp4+Og84uH3+XOB7HfMe0B77XTrafgY8aJbtzBfrLRKdGet/DjiyY/qJ9J5Mvxx4/4ztnUGTsG3fvh5Pp00WO5a5RUxsnEwfD/xzx7zb0SSSy9rpAh7dMf8UYFWP780XAx/tmC46EnHgr4DPdsR5c8Latn0DeDZwF+CGzmMDDgU+32W/TwG+ucD3yNqOebdtY78rzReVm4A7zLKNd9EmvR1tl9Am2zPaD+2MsY/P7myv58xk+riOeU8GvjPbsu3064CP0+Wz78OHj+4Pf8KRbv3uDvywY/qHbdv0vMs65t38PMnS9ifgXyb5Zccyl1fVjp0PYKOfm6dV1VqaROpo4KokJ3eWDswS5zVVdd2MWHeZL9ZubUn+vOOn9l8A9wd26ljkyo7n/6+NeWbb7QaIdT4zj+WH3RacxT2AZ0wfU3tcj6YZTb2eZnT1SOCK9if++/QR081xVNUvab5MdB7TTzqe/4rZ+4Yk925LBn6S5Frgn9i43+GWx9/5vvhxVdUs8+9BM2p6Rcex/wfNyC1J7ty+x37c7vcD8+y3l/fIzcdcVb9qn96O5gvVNVX181m64B7AS2a8RrvNOMZpu9EkzbOZ67Pbi55er9abgbXAp9rym1V97EfarJlMS7d+l9P8cZ+2tG2D5uflXTvm7Tb9pKp+VFW3m34MuvOq+lBVPbqNoYA3Tc+aJc47JtlhRqw/ni/Wzt1NP0lyD+A9wFHAndqk/wJg1jrfPs0X63yuYOP4l86Yfz3NKOi0u3Y8v4xmZLrzC832VXUMQFWdUVVPoBk5/Q5NH8At+3umjd4nSbYH7tTHMXV6V7vvParq9jT18TP7febxX94xvcuMeuzp+ZfRjEzv1HHst6+q+7XLvZHmOB/Y7vfPZtnvYr1HLqN5D+zYZd4/zniNbltVJ3VZ9l5d9jHXZ3ej90iSzvdI36rquqp6SVXdE1gB/G1nrbqk7kympVuXrZNs1/HYCjgJeFWSnZPsBLyaZsQOmp/qD0ty3yS3bectmiR7Jnlce9LUr2lGen/Xzr4SWDZ9klNVXUZTE/rGNvYHAofT1HpOx/qKJHdIsgtNAjSX7WkSp6vbWA6jGXVcsB5inc8pwAuT7JrkDsDMUcDzgEOSbJ1kCviTjnkfAFYkeVKaExm3S3PS565J7pLkoDYRvgH4JRv3967pOBFwhg/RvBce1L5e/wR8varW9XhMnXagqd3+ZTsy/vxZlnlZ+1ruBryIpvZ42p1p+mfrJM8A7gucXlVXAJ8C3prk9km2SHPS4WM69vtL4Bfte+Rl88Q58HukjeWTwDvb49g6yR+3s98DHJnkYWlsn+SAGV++pn0CuGuSF6c54XCHJA9r58312T0fuF/7em1H8+tPP64E7jk9keaEyeXtl5hrad43v+u2sqTfM5mWbl1Op0lYpx9H05xYdzbNCXvfBs5t26iqTwL/RnMy2Frgq+12blikeLalOXHwpzQ/Od+ZZpQS4L/bf3+W5Nz2+aE0J6xdDnwUeE1Vfbqd9zpgPfAD4DPAh+eKs6ouAt5Kc0xX0tREf3kxDqqHWOfzHpo65/NpXo+PzJj/DzSjlT8HXkuT6AI3J/IH0/Tj1TQjmy+j+f98C5oT2S6nOfnsMTT1yNDUaV8I/CTJT2cGVFWfbff7f2lGzu8FHNLj8cz0UuBPaU7gfA8bJ8rTPg6cQ/PF4TSamu1pXwf2oHnf/CPwJ1X1s3benwPbABfR9M+HaUbhoemrB9OcsHgat+zXjSzCe+TZNHXl36E5N+DF7XbPpjlp8R1tjGtpapxni+E6mpMmV9B8Rr4HPLadPddn97s0n4nPtOvMWmo1h6OB97VlKP+Hpr8/Q/Nl5KvAO6vqC31uU9osTZ+RLEmkuYTaBTRXHbhx1PHMJcnzgUOq6jHzLjzmkiyj+ZKw9bj3+2JIUjQlIGtnmfdcmpNZH73JA5OkATgyLW3mkjw1yTZtucGbgDXjmNAluVuSR7U/7e9JMwL70VHHJUnavI1NMt3+gfzHNBfqf86o45E2I39JUy7wfZoaydnqW8fBNjRXbriOpmTh4zTXeZYkaWSGWuaR5ATgQOCqqrp/R/t+wNtpLmZ/XFUdk+auUwfT1Pmd1tbvSZIkSWNr2CPTJ9LcYe1maW77eiywP7AXcGiSvWjulPbVqvpbxndkTJIkSbrZUJPpqjqTZqS50940d5S6tKp+A5xMMyK9nuasZ/ByPJIkSZoAW41gn7uw8R2o1gMPoyn7+PckfwSc2W3lJCuBlQDbb7/9Q+5zn15v7iVJkiQN5pxzzvlpVe08s30UyfRsd5aq9jath8+3clWtBlYDTE1N1dlnn73I4UmSJEkbS/LD2dpHcTWP9Wx8G9ld2fg2svNKsiLJ6g0bNixqYJIkSVI/RpFMnwXskWT39ra2hwCn9rOBqlpTVSuXLFkylAAlSZKkXgw1mU5yEs1tSfdMsj7J4e3NII6iuZXuxcApVXVhn9t1ZFqSJEkjN9G3E7dmWpIkSZtCknOqampm+9jcAVGSJEmaNBOZTFvmIUmSpHEwkcm0JyBKkiRpHExkMi1JkiSNg4lMpi3zkCRJ0jiYyGTaMg9JkiSNg1HcTnziLVt1Wt/rrDvmgCFEIkmSpFGayJFpyzwkSZI0DiYymbbMQ5IkSeNgIpNpSZIkaRyYTEuSJEkDmshk2pppSZIkjYOJTKatmZYkSdI4mMhkWpIkSRoHJtOSJEnSgEymJUmSpAGZTEuSJEkDmshk2qt5SJIkaRxMZDLt1TwkSZI0DiYymZYkSZLGgcm0JEmSNCCTaUmSJGlAJtOSJEnSgEymJUmSpAFNZDLtpfEkSZI0DiYymfbSeJIkSRoHE5lMS5IkSePAZFqSJEkakMm0JEmSNCCTaUmSJGlAJtOSJEnSgEymJUmSpAGZTEuSJEkDGptkOsk+Sb6U5N1J9hl1PJIkSdJ8hppMJzkhyVVJLpjRvl+SS5KsTbKqbS7gl8B2wPphxiVJkiQthmGPTJ8I7NfZkGRL4Fhgf2Av4NAkewFfqqr9gZcDrx1yXJIkSdKCDTWZrqozgWtmNO8NrK2qS6vqN8DJwMFVdVM7/+fAtsOMS5IkSVoMW41gn7sAl3VMrwceluRpwJOAHYF3dFs5yUpgJcDSpUuHF6UkSZI0j1Ek05mlrarqI8BH5lu5qlYDqwGmpqZqkWOTJEmSejaKq3msB3brmN4VuLyfDSRZkWT1hg0bFjUwSZIkqR+jSKbPAvZIsnuSbYBDgFP72UBVramqlUuWLBlKgJIkSVIvhn1pvJOArwJ7Jlmf5PCquhE4CjgDuBg4paouHGYckiRJ0jAMtWa6qg7t0n46cPqg202yAlixfPnyQTchSZIkLdjY3AGxH5Z5SJIkaRxMZDLtCYiSJEkaBxOZTDsyLUmSpHEwkcm0JEmSNA4mMpm2zEOSJEnjYCKTacs8JEmSNA4mMpmWJEmSxsFEJtOWeUiSJGkcTGQybZmHJEmSxsFEJtOSJEnSODCZliRJkgZkMi1JkiQNaCKTaU9AlCRJ0jiYyGTaExAlSZI0DiYymZYkSZLGgcm0JEmSNCCTaUmSJGlAE5lMewKiJEmSxsFEJtOegChJkqRxMJHJtCRJkjQOTKYlSZKkAZlMS5IkSQMymZYkSZIGZDItSZIkDchkWpIkSRrQRCbTXmdakiRJ42Aik2mvMy1JkqRxMJHJtCRJkjQOTKYlSZKkAZlMS5IkSQMymZYkSZIGZDItSZIkDchkWpIkSRqQybQkSZI0oLFKppNsn+ScJAeOOhZJkiRpPkNNppOckOSqJBfMaN8vySVJ1iZZ1THr5cApw4xJkiRJWizzJtNJXpTk9mkcn+TcJE/scfsnAvvN2N6WwLHA/sBewKFJ9kqyL3ARcGVfRyBJkiSNyFY9LPMXVfX2JE8CdgYOA94LfGq+FavqzCTLZjTvDaytqksBkpwMHAzcDtieJsH+f0lOr6qbej4SSZIkaRPrJZlO+++TgfdW1flJMtcK89gFuKxjej3wsKo6CiDJc4Gfdkukk6wEVgIsXbp0AWFsWstWndbX8uuOOWBIkUiSJGmx9FIzfU6ST9Ek02ck2QFYyIjxbIl43fyk6sSq+kS3latqdVVNVdXUzjvvvIAwJEmSpIXpZWT6cOBBwKVV9askd6Ip9RjUemC3juldgcv72UCSFcCK5cuXLyAMSZIkaWHmHZluyy2uBPZK8sfA/YAdF7DPs4A9kuyeZBvgEODUfjZQVWuqauWSJUsWEIYkSZK0MPOOTCd5E/BMmitt/K5tLuDMHtY9CdgH2CnJeuA1VXV8kqOAM4AtgROq6sJ+gnZkWpIkSeOglzKPpwB7VtUN/W68qg7t0n46cHq/2+tYfw2wZmpq6ohBtyFJkiQtVC8nIF4KbD3sQPqRZEWS1Rs2bBh1KJIkSdqM9TIy/SvgvCSfBW4ena6qFw4tqnk4Mi1JkqRx0EsyfSp9niAoSZIkbQ7mTaar6n3tVTfu3TZdUlW/HW5Yc/MEREmSJI2DeWumk+wDfA84Fngn8N32Enkj46XxJEmSNA56KfN4K/DEqroEIMm9gZOAhwwzMEmSJGnc9XI1j62nE2mAqvouY3Z1D0mSJGkUehmZPjvJ8cD72+lnAecML6T5WTMtSZKkcdDLyPTzgQuBFwIvorkT4pHDDGo+1kxLkiRpHPRyNY8bgH9pH5IkSZJaXZPpJKdU1f9J8m2gZs6vqgcONTJJkiRpzM01Mv2i9t8DN0Ug/bBmWpIkSeOga810VV3RPv2rqvph5wP4q00TXtfYrJmWJEnSyPVyAuITZmnbf7EDkSRJkibNXDXTz6cZgb5Xkm91zNoB+PKwA5MkSZLG3Vw10x8CPgm8EVjV0X5dVV0z1KgkSZKkCTBXzfSGqloHvAr4SVsrvTvwZ0l23DThzS7JiiSrN2zYMMowJEmStJlL1S2uerfxAsl5wBSwDDgDOBXYs6qePOzg5jM1NVVnn332Jt/vslWnbfJ9zmfdMQeMOgRJkqRbrSTnVNXUzPZeTkC8qapuBJ4GvK2q/ga422IHKEmSJE2aXpLp3yY5FPhz4BNt29bDC0mSJEmaDL0k04cBjwD+sap+kGR34APDDUuSJEkaf3NdzQOAqrooycuBpe30D4Bjhh2YJEmSNO7mHZlub919HvA/7fSDkpw65LgkSZKksddLmcfRwN7ALwCq6jyaS+RJkiRJm7Vekukbq2rmBZ3nvp7ekHmdaUmSJI2DXpLpC5L8KbBlkj2S/DvwlSHHNaeqWlNVK5csWTLKMCRJkrSZ6yWZfgFwP+AG4CTgWuDFQ4xJkiRJmgi9XM3jV8Dftw9JkiRJrXmT6SSfZ5Ya6ap63FAikiRJkibEvMk08NKO59sBTwduHE44kiRJ0uTopczjnBlNX07yxSHFowEtW3VaX8uvO+aAIUUiSZK0+eilzOOOHZNbAA8B7jq0iCRJkqQJ0UuZxzk0NdOhKe/4AXD4MIOSJEmSJkEvZR6b5G6HSe4LvAjYCfhsVb1rU+xXkiRJGlQvZR5Pm2t+VX1kjnVPAA4Erqqq+3e07we8HdgSOK6qjqmqi4Ejk2wBvKfH+CVJkqSR6aXM43DgkcDn2unHAl8ANtCUf3RNpoETgXcA/zndkGRL4FjgCcB64Kwkp1bVRUkOAla160iSJEljrZdkuoC9quoKgCR3A46tqsPmXbHqzCTLZjTvDaytqkvb7Z0MHAxcVFWnAqcmOQ34UO+HIUmSJG16vSTTy6YT6daVwL0XsM9dgMs6ptcDD0uyD/A0YFvg9G4rJ1kJrARYunTpAsKQJEmSFqaXZPoLSc4ATqIZpT4E+PwC9plZ2qqqvkBTPjKnqloNrAaYmpq6xZ0ZJUmSpE2ll6t5HJXkqcAft02rq+qjC9jnemC3juldgcv72UCSFcCK5cuXLyAMSZIkaWF6GZmmTZ4XkkB3OgvYI8nuwI9pRrr/tJ8NVNUaYM3U1NQRixSTJEmS1LeekulBJTkJ2AfYKcl64DVVdXySo4AzaC6Nd0JVXdjndh2ZXiBvPy5JkrRwQ02mq+rQLu2nM8dJhj1s15FpSZIkjdwW3WYk+Wz775s2XTi9SbIiyeoNGzaMOhRJkiRtxuYamb5bkscAB7XXgt7oKhxVde5QI5uDI9ObXr9lIWBpiCRJuvWbK5l+Nc3dCHcF/mXGvAIeN6ygJEmSpEnQNZmuqg8DH07yD1X1+k0Y07w8AVGSJEnjoGvN9LSqen2Sg5K8pX0cuCkCmyemNVW1csmSJaMORZIkSZuxeZPpJG8EXgRc1D5e1LZJkiRJm7VeLo13APCgqroJIMn7gG8CrxhmYJIkSdK4m3dkurVjx/OR11Z4aTxJkiSNg16S6TcC30xyYjsqfQ7wT8MNa27WTEuSJGkczFvmUVUnJfkC8FCaa02/vKp+MuzAJEmSpHHX0+3Eq+oK4NQhxyJJkiRNlF5rpseKNdOSJEkaBxOZTFszLUmSpHEwZzKdZIskF2yqYCRJkqRJMmcy3V5b+vwkSzdRPJIkSdLE6OUExLsBFyb5BnD9dGNVHTS0qHSrsGzVaX0tv+6YA4YUiSRJ0nD0kky/duhR9CnJCmDF8uXLRx2KJEmSNmPznoBYVV8E1gFbt8/PAs4dclzzxeQJiJIkSRq5eZPpJEcAHwb+o23aBfjYEGOSJEmSJkIvl8b7a+BRwLUAVfU94M7DDEqSJEmaBL0k0zdU1W+mJ5JsBdTwQpIkSZImQy/J9BeTvBK4TZInAP8NrBluWJIkSdL46+VqHquAw4FvA38JnA4cN8ygtHnyUnqSJGnSzJtMV9VNSd4HfJ2mvOOSqrLMQ5IkSZu9Xq7mcQDwfeDfgHcAa5PsP+zA5olpRZLVGzZsGGUYkiRJ2sz1UjP9VuCxVbVPVT0GeCzwr8MNa25eZ1qSJEnjoJdk+qqqWtsxfSlw1ZDikSRJkiZG15rpJE9rn16Y5HTgFJqa6WfQ3AVRkiRJ2qzNdQLiio7nVwKPaZ9fDdxhaBFJkiRJE6JrMl1Vh23KQCRJkqRJM++l8ZLsDrwAWNa5fFUdNLywpPl5XWpJkjRqvdy05WPA8TR3PbxpqNFIkiRJE6SXZPrXVfVvQ49EkiRJmjC9JNNvT/Ia4FPADdONVXXuYgeT5CnAAcCdgWOr6lOLvQ9JkiRpsfSSTD8AeDbwOH5f5lHt9LySnAAcSHO96vt3tO8HvB3YEjiuqo6pqo8BH0tyB+AtNAm8JEmSNJZ6SaafCtyzqn4z4D5OpLkN+X9ONyTZEjgWeAKwHjgryalVdVG7yKva+ZIkSdLY6uUOiOcDOw66g6o6E7hmRvPewNqqurRN0k8GDk7jTcAnh1FGIkmSJC2mXkam7wJ8J8lZbFwzvZBL4+0CXNYxvR54GM0l+PYFliRZXlXvnrlikpXASoClS5cuIARtbryUniRJWmy9JNOvGcJ+M0tbtVcNmfPKIVW1GlgNMDU1VUOITZIkSerJvMl0VX1xCPtdD+zWMb0rcHmvKydZAaxYvnz5Yscl3cyRbEmSNJ95a6aTXJfk2vbx6yS/S3LtAvd7FrBHkt2TbAMcApza68pVtaaqVi5ZsmSBYUiSJEmDmzeZrqodqur27WM74Ok0V+foSZKTgK8CeyZZn+TwqroROAo4A7gYOKWqLuxjmyuSrN6wYUOvq0iSJEmLrpea6Y1U1ceSrOpj+UO7tJ8OnN7v/tt11wBrpqamjhhkfUmSJGkxzJtMJ3lax+QWwBTNTVskSZKkzVovI9MrOp7fCKwDDh5KND3yBERJkiSNg16u5nHYpgikH5Z5SJIkaRx0TaaTvHqO9aqqXj+EeCRJkqSJMdfVPK6f5QFwOPDyIcc1J6/mIUmSpHHQNZmuqrdOP2juOHgb4DDgZOCemyi+brF5nWlJkiSN3Jw100nuCPwt8CzgfcCDq+rnmyIwSZIkadx1HZlO8maaOxVeBzygqo4el0TaMg9JkiSNg7lqpl8C3B14FXB5xy3Fr1uE24kviGUekiRJGgddyzyqat5bjUtamGWrTutr+XXHHDCkSCRJ0iBMmCVJkqQBTWQybc20JEmSxsFEJtPWTEuSJGkcTGQyLUmSJI0Dk2lJkiRpQHPetEVS7/q9MockSZp8jkxLkiRJA5rIZNqreUiSJGkcTGQy7dU8JEmSNA6smZYmiHdMlCRpvEzkyLQkSZI0DhyZlnQzR74lSeqPI9OSJEnSgEymJUmSpAFNZDLtpfEkSZI0DiYymfbSeJIkSRoHE5lMS5IkSePAZFqSJEkakJfGk27F+r3UnSRJ6o8j05IkSdKATKYlSZKkAZlMS5IkSQMymZYkSZIGNDbJdJJ7Jjk+yYdHHYskSZLUi6Em00lOSHJVkgtmtO+X5JIka5OsAqiqS6vq8GHGI0mSJC2mYY9Mnwjs19mQZEvgWGB/YC/g0CR7DTkOSZIkadENNZmuqjOBa2Y07w2sbUeifwOcDBw8zDgkSZKkYRhFzfQuwGUd0+uBXZLcKcm7gT9M8opuKydZmeTsJGdfffXVw45VkiRJ6moUd0DMLG1VVT8Djpxv5apaDawGmJqaqkWOTZIkSerZKJLp9cBuHdO7Apf3s4EkK4AVy5cvX8y4JPWp39uVrzvmgCFFIknSaIyizOMsYI8kuyfZBjgEOLWfDVTVmqpauWTJkqEEKEmSJPVi2JfGOwn4KrBnkvVJDq+qG4GjgDOAi4FTqurCPre7IsnqDRs2LH7QkiRJUo+GWuZRVYd2aT8dOH0B210DrJmamjpi0G1IkiRJCzU2d0CUJEmSJs1EJtOWeUiSJGkcTGQy7QmIkiRJGgcTmUxLkiRJ42Aik2nLPCRJkjQOJjKZtsxDkiRJ42Aik2lJkiRpHExkMm2ZhyRJksbBRCbTlnlIkiRpHExkMi1JkiSNA5NpSZIkaUATmUxbMy1JkqRxMJHJtDXTkiRJGgcTmUxLkiRJ48BkWpIkSRqQybQkSZI0IJNpSZIkaUBbjTqAQSRZAaxYvnz5qEOR1Idlq07re511xxwwhEgkSVocEzky7dU8JEmSNA4mMpmWJEmSxoHJtCRJkjQgk2lJkiRpQCbTkiRJ0oBMpiVJkqQBeWk8SZu1fi/X56X6JEmdJnJk2kvjSZIkaRxMZDItSZIkjQOTaUmSJGlAJtOSJEnSgEymJUmSpAGZTEuSJEkDMpmWJEmSBmQyLUmSJA3IZFqSJEka0NjcATHJ9sA7gd8AX6iqD444JEmSJGlOQx2ZTnJCkquSXDCjfb8klyRZm2RV2/w04MNVdQRw0DDjkiRJkhbDsMs8TgT262xIsiVwLLA/sBdwaJK9gF2By9rFfjfkuCRJkqQFG2qZR1WdmWTZjOa9gbVVdSlAkpOBg4H1NAn1ecyR5CdZCawEWLp06eIHLWmsLFt12qhD2KQ2xfGuO+aAoW6/32MYdjy3BoO8L3ydNQluDe+jUZyAuAu/H4GGJoneBfgI8PQk7wLWdFu5qlZX1VRVTe28887DjVSSJEmawyhOQMwsbVVV1wOH9bSBZAWwYvny5YsamCRJktSPUYxMrwd265jeFbi8nw1U1ZqqWrlkyZJFDUySJEnqxyiS6bOAPZLsnmQb4BDg1BHEIUmSJC3IsC+NdxLwVWDPJOuTHF5VNwJHAWcAFwOnVNWFfW53RZLVGzZsWPygJUmSpB4N+2oeh3ZpPx04fQHbXQOsmZqaOmLQbUiSJEkLNZG3E3dkWpIkSeNgIpNpT0CUJEnSOJjIZFqSJEkaBxOZTFvmIUmSpHEwkcm0ZR6SJEkaBxOZTEuSJEnjIFU16hgGluRq4IebaHc7AT/dRPu6NbC/+mN/9cf+6o/91R/7qz/2V3/sr/6MU3/do6p2ntk40cn0ppTk7KqaGnUck8L+6o/91R/7qz/2V3/sr/7YX/2xv/ozCf1lmYckSZI0IJNpSZIkaUAm071bPeoAJoz91R/7qz/2V3/sr/7YX/2xv/pjf/Vn7PvLmmlJkiRpQI5MS5IkSQMymZ5Hkv2SXJJkbZJVo45n3CTZLcnnk1yc5MIkL2rb75jk00m+1/57h1HHOk6SbJnkm0k+0U7bX10k2THJh5N8p32fPcL+6i7J37SfxQuSnJRkO/trY0lOSHJVkgs62rr2UZJXtH8DLknypNFEPTpd+uvN7WfyW0k+mmTHjnn214z+6pj30iSVZKeONvtrlv5K8oK2Ty5M8s8d7WPXXybTc0iyJXAssD+wF3Bokr1GG9XYuRF4SVXdF3g48NdtH60CPltVewCfbaf1ey8CLu6Ytr+6ezvwP1V1H+APaPrN/ppFkl2AFwJTVXV/YEvgEOyvmU4E9pvRNmsftf+fHQLcr13nne3fhs3Jidyyvz4N3L+qHgh8F3gF2F+tE7llf5FkN+AJwI862uyvWforyWOBg4EHVtX9gLe07WPZXybTc9sbWFtVl1bVb4CTaV5ctarqiqo6t31+HU2iswtNP72vXex9wFNGEuAYSrIrcABwXEez/TWLJLcH/hg4HqCqflNVv8D+mstWwG2SbAXcFrgc+2sjVXUmcM2M5m59dDBwclXdUFU/ANbS/G3YbMzWX1X1qaq6sZ38GrBr+9z+mv39BfCvwN8BnSer2V+z99fzgWOq6oZ2mava9rHsL5Ppue0CXNYxvb5t0yySLAP+EPg6cJequgKahBu48whDGzdvo/kP9aaONvtrdvcErgbe25bFHJdke+yvWVXVj2lGcH4EXAFsqKpPYX/1olsf+Xdgfn8BfLJ9bn/NIslBwI+r6vwZs+yv2d0b+KMkX0/yxSQPbdvHsr9MpueWWdq8/MksktwO+L/Ai6vq2lHHM66SHAhcVVXnjDqWCbEV8GDgXVX1h8D1WKLQVVvnezCwO3B3YPskfzbaqCaefwfmkOTvacr9PjjdNMtim3V/Jbkt8PfAq2ebPUvbZt1fra2AO9CUj74MOCVJGNP+Mpme23pgt47pXWl+MlWHJFvTJNIfrKqPtM1XJrlbO/9uwFXd1t/MPAo4KMk6mrKhxyX5APZXN+uB9VX19Xb6wzTJtf01u32BH1TV1VX1W+AjwCOxv3rRrY/8O9BFkucABwLPqt9fZ9f+uqV70XzBPb/9v39X4Nwkd8X+6mY98JFqfIPml9ydGNP+Mpme21nAHkl2T7INTdH7qSOOaay03xSPBy6uqn/pmHUq8Jz2+XOAj2/q2MZRVb2iqnatqmU076fPVdWfYX/Nqqp+AlyWZM+26fHARdhf3fwIeHiS27afzcfTnMdgf82vWx+dChySZNskuwN7AN8YQXxjJcl+wMuBg6rqVx2z7K8ZqurbVXXnqlrW/t+/Hnhw+/+b/TW7jwGPA0hyb2Ab4KeMaX9tNeoAxllV3ZjkKOAMmrPiT6iqC0cc1rh5FPBs4NtJzmvbXgkcQ/OzzOE0f+CfMZrwJob91d0LgA+2X2gvBQ6jGQiwv2aoqq8n+TBwLs1P79+kuXvY7bC/bpbkJGAfYKck64HX0OUzWFUXJjmF5kvcjcBfV9XvRhL4iHTpr1cA2wKfbr638bWqOtL+mr2/qur42Za1v7q+v04ATmgvl/cb4Dntrx9j2V/eAVGSJEkakGUekiRJ0oBMpiVJkqQBmUxLkiRJAzKZliRJkgZkMi1JkiQNyGRa0lhLUkne2jH90iRHL9K2T0zyJ4uxrXn284wkFyf5/LD3NYnaa8Z+Jsl5SZ65iff9lCR7dUy/Lsm+mzIGSZPNZFrSuLsBeFqSnUYdSKckW/ax+OHAX1XVY4cVz3z6jHdT+0Ng66p6UFX91ybe91OAm5Ppqnp1VX1mE8cgaYKZTEsadzfS3Hjkb2bOmDmynOSX7b/7JPliklOSfDfJMUmeleQbSb6d5F4dm9k3yZfa5Q5s198yyZuTnJXkW0n+smO7n0/yIeDbs8RzaLv9C5K8qW17NfBo4N1J3jxj+bT7uaBd75kd8/6ubTs/yTFt2/J2BPf8JOcmuVcb0yc61ntHkue2z9cleXWS/wWekeSJSb7arvvfSW7Xsdxr2/ZvJ7lP2367JO9t276V5Olte7ftHJPkonbZt8zSP3dM8rF2/teSPDDJnYEPAA9qR6bvNWOdh7TH+9Xpvmrbn5vkHR3LfSLJPv3El+SRwEHAm6f33fmeSvL4JN9sj/+EJNvO1V+SNk8m05ImwbHAs5Is6WOdPwBeBDyA5i6d966qvYHjaO6qOG0Z8BjgAJqEdzuakeQNVfVQ4KHAEWluXQuwN/D3VbVXxzZIcnfgTTS3wH0Q8NAkT6mq1wFnA8+qqpfNiPFp7bJ/AOxLk9TdLcn+NCOmD6uqPwD+uV3+g8CxbdsjgSt66IdfV9Wjgc8ArwL2raoHtzH9bcdyP23b3wW8tG37h7YfHlBVDwQ+l+YXgltsJ8kdgacC92uXfcMssbwW+GY7/5XAf1bVVcDzgC+1I9Pfn7HOe4EXVtUjejhW+omvqr5Cc3vil83cd/s+OBF4ZlU9gOaOwc+fp78kbYZMpiWNvaq6FvhP4IV9rHZWVV1RVTcA3wc+1bZ/myaBnnZKVd1UVd+juV35fYAnAn+e5Dzg68CdgD3a5b9RVT+YZX8PBb5QVVdX1Y00ie8fzxPjo4GTqup3VXUl8MV2O/sC762qXwFU1TVJdgB2qaqPtm2/np4/j+myiYfTlDN8uT2u5wD36FjuI+2/5/D7/tmX5osM7T5/Psd2rgV+DRyX5GnAbLE9Gnh/u63PAXea6wtSO2/Hqvpi2/T+eY92YfF12hP4QVV9t51+Hxu/nrP1l6TN0FajDkCSevQ24FyakcppN9IOCiQJsE3HvBs6nt/UMX0TG//fVzP2U0CAF1TVGZ0z2jKC67vEl3ni72edzBJXt2Vv7oPWdjPmT8cb4NNVdWiX7Uz3z+/4ff90i2PW7STZG3g8cAhwFM0o/XzHMHP7M5fvNr/bcS8kvvli7TRbf0naDDkyLWkiVNU1wCk0JRjT1gEPaZ8fDGw9wKafkWSLtlb3nsAlwBnA85NsDZDk3km2n2c7Xwcek2SnNCf7HUoz0jyXM4FnpqnR3plm5PMbNKPof5Hktu3+79iOzq9P8pS2bdt2/g+BvdrpJTTJ4my+BjwqyfJ2/dsmufc88X2KJumkXecO3bbT1iUvqarTgRfTlK/MdrzPatfbh6ZU4tpuO6+qXwAbkjy6bXpWx+x1NHXWWyTZjab8putxzhHfdcAOs+z+O8Cy6e3QlArN93pK2gz5bVrSJHkrHckd8B7g40m+AXyW7qPGc7mEJkm6C3BkVf06yXE0P92f2454X01Tw9xVVV2R5BXA52lGNU+vqo/Ps++PAo8AzqcZgf27qvoJ8D9JHgScneQ3wOk0NcbPBv4jyeuA3wLPqKpLk5wCfAv4HvDNLvFdnebExJOmT6SjqS3+7mzLt94AHNue9Pc74LVV9ZEu27mO5rXYrj3+W5wwChwNvDfJt2jKLJ4zV+e0DgNOSPIrmi85074M/ICmbOcCml8t5jrObvGdDLwnyQuBm09mbd8HhwH/nWQr4Czg3T3EK2kzk6q5fmGTJGk8JFkGfKKq7j/qWCRpmmUekiRJ0oAcmZYkSZIG5Mi0JEmSNCCTaUmSJGlAJtOSJEnSgEymJUmSpAGZTEuSJEkDMpmWJEmSBvT/AZa6SPwU6tDEAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 864x360 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "plt.figure(figsize=(12, 5))\n",
    "plt.hist(df_ques.question.value_counts(), bins=50)\n",
    "plt.yscale('log', nonposy='clip')\n",
    "plt.title('Log-Histogram of question appearance counts')\n",
    "plt.xlabel('Number of occurences of question')\n",
    "plt.ylabel('Number of questions')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 文本分析\n",
    "1. 问题的字母数分布情况；\n",
    "2. 问题的单词数分布情况；"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:18.085094Z",
     "iopub.status.busy": "2021-07-10T11:53:18.084773Z",
     "iopub.status.idle": "2021-07-10T11:53:19.455897Z",
     "shell.execute_reply": "2021-07-10T11:53:19.454725Z",
     "shell.execute_reply.started": "2021-07-10T11:53:18.085065Z"
    }
   },
   "outputs": [],
   "source": [
    "train_qs = pd.Series(df_train['question1'].tolist() + df_train['question2'].tolist()).astype(str)\n",
    "test_qs = pd.Series(df_test['question1'].tolist() + df_test['question2'].tolist()).astype(str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:19.460092Z",
     "iopub.status.busy": "2021-07-10T11:53:19.459759Z",
     "iopub.status.idle": "2021-07-10T11:53:22.137203Z",
     "shell.execute_reply": "2021-07-10T11:53:22.136145Z",
     "shell.execute_reply.started": "2021-07-10T11:53:19.460059Z"
    }
   },
   "outputs": [],
   "source": [
    "dist_train = train_qs.apply(len)\n",
    "dist_test = test_qs.apply(len)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:22.142795Z",
     "iopub.status.busy": "2021-07-10T11:53:22.142339Z",
     "iopub.status.idle": "2021-07-10T11:53:24.699907Z",
     "shell.execute_reply": "2021-07-10T11:53:24.698888Z",
     "shell.execute_reply.started": "2021-07-10T11:53:22.142762Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mean-train 59.82 std-train 31.96 mean-test 60.07 std-test 31.62 max-train 1169.00 max-test 1176.00\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAA4oAAAJjCAYAAABKse8GAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAABFeUlEQVR4nO3de7xtZV0v/s9XRIHUIBBEUCEPlh52oiJqWkknVLQjmWlew+yEsvX8tIsJdVK0Y5Dl9eTGWySWeMkrJiqKkJaabAxF8AIqyRYCRDaCiNfn98cYC+aYe133XmvNdXm/X6/1WnNc53eMOffc87OeZzyjWmsBAACAKbeadAEAAACsLIIiAAAAA4IiAAAAA4IiAAAAA4IiAAAAA4IiAAAAA4IisOiq6oSqalX14WmWvbOqzplAWQtWVQ/tj+PgkXmtqp69DM99cP9cD11IfTOs96aq2ryA535YVT133sWuE1X1gqr6ZlX9pKretJ37mNdrthJU1eOr6mmTrmMSqupPZvu3N7Leqnk9F6qqDquqE6aZf0JVfWsCJQHLTFAEltLDqur+ky5ikT0oyT9NuogF+oskT1vA+g9L8twlqWSVqqpDk7woyd8meXC6c7rWPT4Le9+sJX+S5KHzWO+z6T4Tvrqk1UzGYUleOM38NyZ5+DLXAkzArSddALBmfTvJliR/luQ3FnvnVbVra+17i73fubTWPr3cz7mjWmsr/ktsVe2c5CettR9PupYZ/Hz/+zWtte9MtJJZVNUurbWbJl3HuKqqJLddibXtiP69sOo+E3ZEa21Lus92YI3ToggslZbkL5M8uqo2zLZiVR1SVWdV1Y1VdW1VvaWq9hlZfkDfvevJVfXmqtqa5P0j859QVX9fVd+pqi1V9ZR+uz+pqsur6uqq+ququtXIPn++qt5WVZf1z3thVT13dJ0Zah10Pa2qh1TVJ/rn/k5VnV9Vjxvb5n/1+/9+Vf1nVf3JNPvd2Nfy3ap6f5J9Zz27Q3tV1T9V1Q1V9bWq2ji270HX06ravare2J+bm6rqG1X1hn7ZCUn+KMnd+mNto90s++6IF/THcllVvaSqbj32fA+tqs/3+z6378L2rdFubFV1Tt8N+Ziq+mqSm5LceT6vy0h3v/9RVe/rz9nFfZfZnarqr/vn+2ZV/eFcJ6/f5oT+PHy/f84njZ6/JP/QT15Xc3cJ/oWqen9Vbe1fk89U1RFjq831mj2oqk7vX6Pv9u+rJ4+t87S+lsP68/m9JM/rl53Uv0439P8m3lJVd5qm1t/v17upqq7sX5Of7o/5sUl+ZeR9cMLIdkdV1eZ+u/+qqpdWF/anlp/QvwYPqapz072+jxt//vmes6o6sKreW92/sev7df/byPKpz4JfH9vv+Ht/qq77VNWn+/fYf1TVL42sc2mSPZO8cOTYHzpD3TN1T39OVf1ldZ89V1XVa6rqtjMd/8i2g8+Bqjpi9Pnne5z9vIOr6gP9+bq+f7/daWT5zlX1NyPv+8ur6j1VdZvquhz/v5HjadVfMlDTdD2d6/WZ73mpWT6bgOWnRRFYSv+UrrvenyV5wnQrVNUdk5yT5ItJnpTkdklOSvKRqjq0tfaDkdX/Jsm7033hHG15+qskb0n3xfbpSU6tqvskuVs/fb8k/zfJfyR5W7/Nfkm+3G93fZJD+lp3TXLifA6uqu6Q5J+TvC/Ji5NUkg1Jdh9Z53npAvNL++O8X5K/qKobW2t/269zVJLXJHltkvcm+ZUkp8ynht4bkpya5PVJnpjkNVW1ubX2mRnWf3mSX0zyB0n+K8ldkvxyv+yNSQ5K8qtJHtPPu7qv82FJ3p7kzekCyS+k64K5Z5Jn9uvsl+SMJJ9M8qdJ7pTuHO86TR0PTnL3JM9PcmOS65LcI/N/XV7X/7wmXVfBd/bbVbr30qOSvKyqPjlHS/CL++1flOTcdO+jt1RVa629tT/Gy5L8n/68fC/JRdPtqKp+Psm/9cfwzCTXJDk03TkeNddrdrd+P69NF7IenOTvq+onfU2j3prk5L7+rf28vdO97y5Pcsd04f9jVbVhqtW2qv5Pf+yb0r2eu/Xn7Hb9Md813Xt5KsRu6bd7fP+cr0v3Gt893WtzqyR/PFLXbv0xvjTJV/paFnzO+iBxVpIfJvn9JD/qj/Vf+uP59nT7ncVUXa9I9/5/YZL3VNVdW2s3pnvfn53u/fTGfptpX+9Z/FGSjyV5Srp/Jycm+c9052Jai/A5MLqv/5bunG5O8tQkO6V7Td9fVYe11lqS45M8OclxSb6e7t/qI/t1P5DkZf1xPKjf7bQt6Qt8feY6L7N9NgHLrbXmx48fP4v6k+SEJN/qHz8tXai7Rz/9ziTnjKx7Urovt3cYmXdYuhbJJ/bTB/TT7xl7nqn5fz8y7w7pvrBcnGSnkfmfSfL2GeqtdH84+9MkXxuZ/9B+/wePzGtJnt0/PrSfvv0M+71DkhuSvHBs/ovTfQnaaaS2D46t84Z+3w+d5TxP1ffikXk7pwt2J43Me1OSzSPTX0jyv2fZ798kuXSa+Z9OcvbYvD/pX9/9++m/TvKtJLuOrPP4vs4TRuadky5w3WmWOuZ6XV44Mu9e/byPjcy7VX+e/2qW5/iZJN+d5jU6I8mXR6af1u//dnO899+aLlDtOsPyeb1mM5yH140d31RNz5mjpp3S/WGkJfnlft7u6cL5y2fZbvBvdaSW/8zIv7l+/tP713PPfvqE/vmOmq22eZ6zZ6YLHz87Mm//JD9Icnw/fUD/fL8+tu34e3+qrl8dmXdIP+8RI/O+Nfp+nce/wfHPiI+PrffeJJ+eY19zfg4s4Dj/IV3wvs3IvIPS/Vt9VD/9z0leNks9z07Sppl/QvrP9/m+PvM9L5njs8mPHz/L+6PrKbDU/jHJN9L99Xo6hyU5s41c99W6VpVLkzxkbN0PzLCPs0a2/U66L93/0obXu12S7styku5arqp6UVVdkuT76cLlS5IcWGNdKWfx1XRB8LTquuLtPrb8QUl+Ksk/VdWtp37S/UV9nyT7V9VOSe6TrlVy1LvnWUOSnDn1oLU2FZL3n2X985M8r+/mdo/5PEFf532z7UA+b08XyKZaHe6f5CNteP3o6TPs9rzW2n+NPc9CXpezRh5f0v/+2NSM1tpPknwtI6/7NA5O18I03XHdo6r2nmXb6fxquj9IzHX97KyvWVXtUVWvrqr/THcOfpjkmHQtruO2+XdRVUdW1Ser6rp0X+Knrimb2v5B6Vpp/35eR3WLe6RraXzHNO/pXdKdz5sPLckH57HPuc7ZYUk+21r72s077q6T+7ds+xkxHz9M94eKKVOthbP9m1moM8emL5pt/4v0OTDq15K8J8lPRl6jr6f7XD20X+f8JE+rrov+L1RVbedzLeT1meu8nJ8FfjYBS0dQBJZUa+1H6boVPaWq7jbNKvsmuXKa+Vema+0ZnzedrWPTP5hh3i4j03+Vrpvc69N1t7p/uu6pGVtvRq21a9ONELpzknckubq/Juhn+1X26n9fmFu+7P8wXbe2pOtWdcd0rUVXje1+fHo2W8emx4913LPT/SX/BUm+XN31fdN2DR6xV7rjHH8NpqanXqs7pe+qOqV1A5jcMM0+p3s9F/K6bB15jh+Mz+vNdS6mrgWd6bj2mGXb6eyZ5Ip5rLd1bHq8zjcl+e10LbQPS3ceTsn0xzKovbqRhk9PFw6fmi4UPrBfPLX9nv3v+dQ6auo9fUaG7+mv9/NHu9he24Zdx2cy1zlbyGfEfHyn/yNCksF7Z17/7udp69j0XO/DxfgcGLVXui7dPxz7+dnc8hr933RdXTcm+VySy6rqOdvxXAt5fbaOTY+fl+35bAKWiGsUgeVwSrrru54/zbIr0l1PNW6fJOeNzWuLWNPjkvy/1trN1wxV1aMWupPW2qeSPKKqdk33V/yXJzkt3RfzqWtzfj3Tf5H6crrufz/KtudgoS1Z89Za25rk/0vy/1XVL6TrPvqWqvp8a22ma7G+le6L5nhdU4MOTR3rf6X70nuzqtol3XVv25QyzbxFeV0WYCqg7J3u2rgp48c1X9dkYQMRbaM/X49K18X5tSPzZ/rj7vh5fEy6sP7brbXWbzv+R5qpY9033Ws7X1Pn45h01/yO+/rI4/n+e53rnF2R5L9PM3+fkXqmRlO9zdg62xMkJ+HqzO9zYL7H+e10LYpvzLa+ldz8B5wXJHlBVR2UrgvpK6vqy621Dy2g9vm8PvOynZ9NwBLRoggsudba99Nd9/b0bPuF8N+TPLyqbj81o28ROSDJvy5hWbum69o49Zw7ZYYBd+ajtfa91tr704Xie/WzP5Xuuq07t9Y2T/Nzfd899vwkR43t8je3t5YF1v35dAOZ3Cq33AJim9aPvs7zsu3IlY9P8pN0x5p0g8Ec0QfnKY9eQEmL+rrMwxfShfXpjusrrbWrt91kVmcleXwf9rbXbdNdVzh6Hm6f+Z/HXZP8cCok9p48ts7Ue/PoWfYzXSvYl5N8M8kBM7ynr9l2N3Oa65z9e5L7VdWBUzP6QZN+Mbd8RlyV7g8Z9xxZ53a5pUv0Qs3VArioFvA5MN/jPCtdN+DzpnmNLp3m+S9O15L//dzy+fWDfv9znYf5vD4LNsNnE7CMtCgCy2VqhMRfTPIvI/NfnuTYJB+uqr/KLaOeXpDkXUtYz0eSPKu/Fu7bSZ6V7gv6vPUtXU9P11XqG+muhXtG+uvkWmtbq7ulwKv6Fp2Pp/vSc48kh7fWpkYV/csk766qk9O1AvxKkkfsyMHNUfe/9s/zhXStPr+fbkCXqRE3v5Rkn+qGyP9CuoErLk03OuSHq+rv040euyHdSIpv6K9JSpJXpjuX76+qV6TrinpcujB2c3e/Wezw67IQrbVvV9Urk/yfqvpRulEifzNdt9cnbscup0ZO/XhVvSxda9l9klzTWpvXCJatteuqu6XEC6rqO+nO23HpRoW9wzx28ZEkz+2P6/3p/s09Zew5tlbVXyR5SVXdJl1X0tuma8l8UWvtm+neB0dV1W+k68Z6eWvt8qr6oyT/0I/6+8F0geJn090v9bdaN3LoQsx1zt6UrjfCB6vqBekGZDkhXcvY6/rj+UlVvS/JH/TXdW5NN8Lm9t5r9UtJHlVVH0rXbfrLrbXrt3Nf8zXn58ACjvOEdP+eP1BVp6Q7V/slOSLJm1pr51TVe9L98ec/+u1/K933wo/3+/hS//s5VfWxdF12vzxN3W/KHK/PfM3jswlYRloUgWXRf3l8xTTzr05yeLouVW9Nd83MJ5IcMc/rm7bX/+6f5zXpWgG/kHneFmPEJbnlfpFnprsW80PpwmOSpO9CeUySI9MNVPHWdK07nxhZ5z19Pf8zXei8T5LfW/ghzdun0o2Y+c5011buleTIkbD3jnRf/l6a7gv8CX2dZ6Zr3Ts0XQB5broh9G++r2QfMB6Vrsvcu9Md19PTtZDN50b1i/G6LNQL+uc4Nt1IkL+c5CmttbfNutU0+i/SD0n3JfmN6b70/la6kUIX4knpunG+Ocmr0v3R5M3zrOGMdF/cH5vuWsVfSdf9eXy9qWP+tXTvzdelGw11KhBtSve+PiXd++CYfru3p2v5OiTdIEDvTned22fTt0ItxFznrO+R8Gvpgsvfpbu1xX+mGwl0tGvjs9MNoLIp3fvnrRkZ3GiBnpcuoHwg3bHfbzv3M28L+ByY8zhba19J1/39xnTX+34wXSD/fm4Z+OmT6cL9aele//sleWxrbep+jJ9Id43sc9K1Gk4b+hbw+szHXJ9NwDKqYc8UAFhcVfWQdF86f7W1dvZc6wOdqjo4Xe+Kw1tr50y4HGCd0fUUgEXVdyH+j3QD2/xckj9P8vkMuxwDACuYoAjAYrttui5r+6Trxnhmkj8cvSUBALCy6XoKAADAgMFsAAAAGFi3XU/32muvdsABB0y6DAAAgIk477zzvtVau+N0y9ZtUDzggAOyefPmuVcEAABYg/p7sk5L11MAAAAGBEUAAAAGBEUAAAAG1u01igAAwPr2wx/+MFu2bMlNN9006VKW1C677JL9998/O++887y3ERQBAIB1acuWLbn97W+fAw44IFU16XKWRGst11xzTbZs2ZIDDzxw3tvpegoAAKxLN910U/bcc881GxKTpKqy5557LrjVVFAEAADWrbUcEqdszzEKigAAAAy4RhEAACDJhlM3LOr+Ljj6glmXb926Naeddlo2bty4oP0+8pGPzGmnnZbdd999B6qbnRZFAACACdi6dWs2bdq0zfwf//jHs253xhlnLGlITLQoAgAATMRxxx2Xr371qznkkEOy884753a3u1323XffnH/++bnooovyG7/xG7nsssty00035TnPeU6OOeaYJMkBBxyQzZs354YbbsiRRx6ZhzzkIfnkJz+Z/fbbL+973/uy66677nBtWhQBAAAm4KSTTsrd7373nH/++fnrv/7rfOYzn8lLXvKSXHTRRUmSU045Jeedd142b96cV7/61bnmmmu22cfFF1+cZz3rWbnwwguz++67513vetei1KZFEQAAYAU47LDDBvc6fPWrX533vOc9SZLLLrssF198cfbcc8/BNgceeGAOOeSQJMn97ne/XHrppYtSi6AIAACwAvzUT/3UzY/POeecfPSjH82nPvWp7LbbbnnoQx867b0Qb3vb2978eKeddsr3vve9RalF11MAAIAJuP3tb5/rr79+2mXXXXdd9thjj+y222750pe+lE9/+tPLWpsWRQAAgMx9O4vFtueee+bBD35wDj744Oy6667ZZ599bl72iEc8Iq997WvzC7/wC/m5n/u5PPCBD1zW2qq1tqxPuFIceuihbfPmzZMuAwAAmJAvfvGLuec97znpMpbFdMdaVee11g6dbn1dTwEAABgQFAEAABgQFAEAABgQFAEAABgQFAEAABgQFAEAABhwH0UAAIAkOfvExd3f4cfPunjr1q057bTTsnHjxgXv+pWvfGWOOeaY7Lbbbttb3ay0KALzsuHUDYMfAAB2zNatW7Np06bt2vaVr3xlbrzxxkWu6BZaFAEAACbguOOOy1e/+tUccsghOeKII7L33nvnHe94R77//e/nMY95TF70ohflu9/9bh7/+Mdny5Yt+fGPf5w///M/z5VXXpnLL788hx9+ePbaa6+cffbZi16boAgAADABJ510Ur7whS/k/PPPz5lnnpl3vvOd+cxnPpPWWh796Efn4x//eK6++urc+c53zgc+8IEkyXXXXZef/umfzstf/vKcffbZ2WuvvZakNl1PAQAAJuzMM8/MmWeemfvc5z65733vmy996Uu5+OKLs2HDhnz0ox/N85///HziE5/IT//0Ty9LPVoUAQAAJqy1luOPPz7PeMYztll23nnn5Ywzzsjxxx+fhz3sYXnBC16w5PVoUQQAAJiA29/+9rn++uuTJA9/+MNzyimn5IYbbkiSfPOb38xVV12Vyy+/PLvttlue8pSn5I//+I/z2c9+dpttl4IWRQAAgGTO21kstj333DMPfvCDc/DBB+fII4/Mk570pDzoQQ9KktzudrfLP/7jP+aSSy7J8573vNzqVrfKzjvvnJNPPjlJcswxx+TII4/MvvvuuySD2VRrbdF3uhoceuihbfPmzZMuA1aN8VtiXHD0BROqBABgcXzxi1/MPe95z0mXsSymO9aqOq+1duh06+t6CgAAwICgCAAAwICgCAAArFvr4VK87TlGg9kAMxq/LhEAYC3ZZZddcs0112TPPfdMVU26nCXRWss111yTXXbZZUHbCYoAAMC6tP/++2fLli25+uqrJ13Kktpll12y//77L2gbQREAAFiXdt555xx44IGTLmNFco0iAAAAA4IiAAAAA4IiAAAAA4IiAAAAA4IiAAAAA4IiAAAAA26PAdxsw6kbJl0CAAArgBZFAAAABgRFAAAABgRFAAAABlyjCGyX8esZLzj6gglVAgDAYtOiCAAAwICgCAAAwICgCAAAwICgCAAAwICgCAAAwICgCAAAwICgCAAAwICgCAAAwICgCAAAwICgCAAAwICgCAAAwICgCAAAwMCtJ10AsHw2nLphMH3B0RdMqBIAAFYyLYoAAAAMaFEEFoXWSgCAtUOLIgAAAANaFGGNGW3Zm6tVb7wVEAAAEi2KAAAAjBEUAQAAGBAUAQAAGBAUAQAAGBAUAQAAGBAUAQAAGBAUAQAAGHAfRWBZLOT+jgAATJagCGvYaDgDAID5Wvaup1X1iKr6clVdUlXHTbO8qurV/fLPV9V9+/l3qaqzq+qLVXVhVT1nZJufqaqPVNXF/e89lvOYAAAA1pJlDYpVtVOS1yQ5Msm9kjyxqu41ttqRSQ7qf45JcnI//0dJ/qi1ds8kD0zyrJFtj0tyVmvtoCRn9dMAAABsh+VuUTwsySWtta+11n6Q5G1Jjhpb56gkb26dTyfZvar2ba1d0Vr7bJK01q5P8sUk+41sc2r/+NQkv7HExwEAALBmLXdQ3C/JZSPTW3JL2Jv3OlV1QJL7JPn3ftY+rbUrkqT/vfd0T15Vx1TV5qrafPXVV2/vMQAAAKxpyz2YTU0zry1knaq6XZJ3JXlua+07C3ny1trrk7w+SQ499NDx5wUWkYF0AABWr+VuUdyS5C4j0/snuXy+61TVzulC4ltaa+8eWefKqtq3X2ffJFctct0AAADrxnIHxXOTHFRVB1bVbZI8IcnpY+ucnuR3+tFPH5jkutbaFVVVSf4uyRdbay+fZpuj+8dHJ3nf0h0CAADA2rasXU9baz+qqmcn+XCSnZKc0lq7sKqe2S9/bZIzkjwyySVJbkzyu/3mD07y1CQXVNX5/bw/ba2dkeSkJO+oqt9L8o0kj1umQ4KJ08UTAIDFttzXKKYPdmeMzXvtyOOW5FnTbPevmf76xbTWrknyPxa3UgAAgPVp2YMisDYde+3WwfTJe+w+kToAANhxgiIwL4IgAMD6sdyD2QAAALDCaVEEloQWSACA1UtQBLbLeBAEAGDt0PUUAACAAUERAACAAUERAACAAdcoAjNazOsQR/e14dQNg2UXHH3Boj0PAAA7TosiAAAAA4IiAAAAA7qeAsvOrTUAAFY2LYoAAAAMCIoAAAAMCIoAAAAMCIoAAAAMGMwGuJlBZgAASLQoAgAAMEZQBAAAYEDXU2DiNpy6YTB9wdEXTKgSAAASLYoAAACM0aIITJxBdAAAVhZBEdYxAQ0AgOnoegoAAMCAoAgAAMCAoAgAAMCAoAgAAMCAwWyAlefsE4fThx8/mToAANYpLYoAAAAMCIoAAAAMCIoAAAAMuEYR1pFjr9066RIAAFgFtCgCAAAwoEURWHE2fW7TYHqjUU8BAJaVFkUAAAAGBEUAAAAGdD0FVr6zTxxO64oKALCkBEVgxXPNIgDA8tL1FAAAgAFBEQAAgAFdT2GV2XDqhkmXAADAGicowhp27LVbJ10CAACrkKAIK5wWRAAAlptrFAEAABgQFAEAABgQFAEAABgQFAEAABgQFAEAABgQFAEAABgQFAEAABgQFAEAABgQFAEAABgQFAEAABi49aQLAFiws0+85fHhx0+uDgCANUpQhDXm2Gu3TroEAABWOV1PAQAAGBAUAQAAGBAUAQAAGHCNIrC6jQ5skxjcBgBgEWhRBAAAYEBQBAAAYEBQBAAAYMA1isCqs+lzm25+vPHeGydYCQDA2qRFEQAAgAFBEQAAgAFBEQAAgAHXKMIqd+y1WyddAgAAa4wWRQAAAAYERQAAAAYERQAAAAYERQAAAAYERQAAAAYERQAAAAYERQAAAAbcRxFWgA2nbph0CavWps9tGkxvPPz4CVUCALB2aFEEAABgQFAEAABgQNdTYG05+8ThtK6oAAALpkURAACAAUERAACAAUERAACAAUERAACAAYPZwCpz7LVbJ10CAABrnBZFAAAABgRFAAAABgRFAAAABgRFAAAABgxmA6xtZ584nD78+MnUAQCwimhRBAAAYEBQBAAAYEBQBAAAYEBQBAAAYEBQBAAAYEBQBAAAYEBQBAAAYEBQBAAAYODWky4A1qMNp26YdAkAADAjQRFWuGOv3TrpEtaWs08cTh9+/GTqAABYwXQ9BQAAYEBQBAAAYEBQBAAAYEBQBAAAYEBQBAAAYEBQBAAAYEBQBAAAYEBQBAAAYEBQBAAAYGDZg2JVPaKqvlxVl1TVcdMsr6p6db/881V135Flp1TVVVX1hbFtTqiqb1bV+f3PI5fjWAAAANaiZQ2KVbVTktckOTLJvZI8saruNbbakUkO6n+OSXLyyLI3JXnEDLt/RWvtkP7njEUtHAAAYB1Z7hbFw5Jc0lr7WmvtB0neluSosXWOSvLm1vl0kt2rat8kaa19PMm3l7ViAACAdWa5g+J+SS4bmd7Sz1voOtN5dt9V9ZSq2mO6FarqmKraXFWbr7766oXUDQAAsG4sd1Csaea17Vhn3MlJ7p7kkCRXJHnZdCu11l7fWju0tXboHe94xzl2CawLZ594yw8AAEmWPyhuSXKXken9k1y+HesMtNaubK39uLX2kyRvSNfFFQAAgO2w3EHx3CQHVdWBVXWbJE9IcvrYOqcn+Z1+9NMHJrmutXbFbDuduoax95gkX5hpXQAAAGZ36+V8stbaj6rq2Uk+nGSnJKe01i6sqmf2y1+b5Iwkj0xySZIbk/zu1PZV9dYkD02yV1VtSfLC1trfJXlpVR2SrovqpUmesVzHBAAAsNYsa1BMkv7WFWeMzXvtyOOW5FkzbPvEGeY/dTFrBAAAWM+WPSgCszv22q2TLgEAgHVuua9RBAAAYIXTogisKZs+t2kwvfHeGydUCQDA6qVFEQAAgAFBEQAAgAFBEQAAgAFBEQAAgAGD2QBrmsFtAAAWTlAE1pVZg+PZJw5XPvz4ZagIAGDl0fUUAACAAS2KsAw2nLph0iUAAMC8aVEEAABgQFAEAABgQFAEAABgwDWKsAIce+3WSZcAAAA306IIAADAgKAIAADAgKAIAADAgKAIAADAgKAIAADAgFFPAWZy9onD6cOPn0wdAADLTIsiAAAAA4IiAAAAA4IiAAAAA4IiAAAAA4IiAAAAA4IiAAAAA26PAaxrmz636ebHG++9cYKVAACsHFoUAQAAGNCiCEtgw6kbJl0CAABsNy2KAAAADGhRhAk49tqtky4BAABmpEURAACAAS2KAPN19onD6cOPn0wdAABLbEEtilX161WlFRIAAGANW2joe1+Sb1bVX1XVPZeiIAAAACZroUHx7klen+TxSb5QVZ+qqt+vqjssfmkAAABMwoKCYmvt0tbaC1trByY5IsklSV6R5Iqq+oeqOnwpigQAAGD5bPf1hq21j7XWnprkHknOS/LkJB+tqq9X1R9UlYFyAAAAVqHtDnNV9StJfjfJY5P8MMlrkrw3ycOTvCjJ/ZM8acdLBFgemz63aTC98d4bJ1QJAMBkLSgoVtXdkhzd/xyQ5JwkxyR5d2vt+/1qZ1XVp5L84+KVCQAAwHJZaIvi15JcnuRNSU5prX19hvUuTPKZHagLYOVzX0UAYI1aaFD8n0k+1Fr7yWwrtda+ksTANsCqpisqALBeLXQwm99KcrfpFlTV3arqlB0vCQAAgElaaFA8OskdZ1i2V78cAACAVWyhQbGStBmWHZzk6h0rBwAAgEmb8xrFqnpOkuf0ky3Je6vq+2Or7ZJkn3SD3AAAALCKzWcwm4uSvCtda+IfJjk7yRVj6/wgyZeSvGNRqwMAAGDZzRkUW2sfSfKRJKmq65O8sbX2zaUuDAAAgMlY0O0xWmsvWqpCAFa90fsquqciALCKzecaxXckOb619tX+8Wxaa+23F6c0AAAAJmE+LYp3TLJz/3jvzDzqKQAAAGvAfK5RPHzk8UOXtBoAAAAmbqH3UQQAAGCNm881ihsXssPW2qbtLwfWpmOv3TrpEgAAYN7mc43i3y5gfy2JoAgAALCKzecaRd1TAQAA1pEF3UcRYD3b9Llhh4mN915Qz3wAgFVjPtco3ivJV1tr3+8fz6q1dtGiVAYAAMBEzKdF8QtJHpjkM/3jme6jWP2ynRanNAAAACZhPkHx8CQXjTwGAABgDZvPYDb/Mt1j4BYbTt0w6RIAAGDRbNdgNlX1c0nun2TfJFck2dxa+9JiFgYAAMBkLCgoVtUdkrwhyWOT3CrJDUlul+QnVfXuJP+rtfadRa8SAACAZbPQFsVNSR6W5HeSvLu1dlNV7ZIuOP5tv/wpi1siwOowevuMjYcfP8FKAAB2zEKD4lFJ/qC1dtrUjNbaTUneUlW7JXn5YhYHAADA8ltoULwh3TWJ07k8yXd3rByA1WO0BREAYC251QLXf02SP66qXUdn9q2Jf5yu6ykAAACr2JwtilX10rFZByW5rKo+kuSqJHsnOSLJ95JsXvQKAQAAWFbz6Xr6uLHpH/Y/DxyZd33/+7FJnrcIdQEAADAhcwbF1tqBy1EIwJpy9onDaaOgAgCryEKvUQQAAGCNW+iop6mqSvLgJPdIssv48taaAW1Y9469duukSwAAgO22oKBYVfskOSvJvZK0JNUvaiOrCYoAAACr2EK7nr4syXVJ7pIuJD4gyQFJ/jzJxelaGQEAAFjFFtr19FeSPCfJFf10tda+keQvq+pW6VoTH76I9QEAALDMFtqiuHuSq1trP0nynXT3UJzyySS/uEh1AQAAMCELDYpfT7Jv//jCJE8eWfY/k3x7MYoCAABgchba9fQDSR6W5B1J/m+S91XVliQ/THLXJM9f3PJg5dpw6oZJlwAAAEtiQUGxtXb8yOMPVtWDkzwm3W0yPtJa++Ai1wcAAMAyW/B9FEe11s5Ncu4i1QKwdp194nD68OOnXw8AYAXYrqBYVQ9Lcli66xWvSPLvrbWPLGZhAAAATMaCgmJV3TnJe5LcP8lV/c/eSV5cVZuTPKa19s1FrxIAAIBls9BRT1+frhXxIa21O7XWfqG1dqckv5TkTklet9gFAgAAsLwWGhR/NcmftNY+OTqztfZvSY5LcvhiFQYAAMBkLDQoXpnkezMs+16Sb+1YOQAAAEzaQoPiX6a7HnH/0Zn99AuTvGSxCgMAAGAy5hzMpqreMTZrzyRfrarP5pbBbO7bP/61dNcxAgAAsErNZ9TTO45NX9z/JMkdktyUZOqaxb0WqS6Atc19FQGAFWzOoNhaM0ANAADAOrLQaxQHqmrnxSoEAACAlWE+XU8HquoXk/x5kock2a2qbkzyiSR/0Vr71CLXB6vCsddunXQJAACwaBYUFKvqiCQfSPLlJH+d7nYZ+yT5rSTnVNWjWmsfXfQqAQAAWDYLbVF8SZLTkzyutdZG5r+4qt6V7vYZgiIAAMAqttBrFDckecNYSJzy+n45AAAAq9hCg+LWJHefYdl/65cDAACwii00KP5TkhOr6ilVtUuSVNUuVfWUdN1S37HYBQIAALC8FnqN4vOT7Jnk1CSnVtUNSW7XL3trvxwAAIBVbEFBsbX2vSRPrqq/SHL/JPsmuSLJua21Ly1BfQDrw9knDqcPP34ydQAAZAFBse9qel2S326tvTeJYAgAALAGzfsaxdbaTUmuSvKjpSsHAACASVvoYDavS/L/VdXOS1EMAAAAk7fQwWx2T3Jwkkur6qwkVyYZvadia60Z0AYAAGAVW2hQfGyS7/ePf2ma5S1GPgUAAFjV5hUUq2rXJI9M8rdJ/ivJR1trVy5lYQAAAEzGnEGxqn42yUeTHDAy+7qq+u3W2plLVRgAAACTMZ/BbF6a5CfpupruluS/Jzk/3cA2AAAArDHz6Xr6oCR/1Fr7t376i1X1jP73vq21KxbyhFX1iCSvSrJTkje21k4aW1798kcmuTHJ01prn+2XnZLk15Nc1Vo7eGSbn0ny9nStnpcmeXxr7dqF1AWwopx94nD68OMnUwcAsC7Np0Vx3yRfG5v31SSV5E4LebKq2inJa5IcmeReSZ5YVfcaW+3IJAf1P8ckOXlk2ZuSPGKaXR+X5KzW2kFJzuqnAQAA2A7zHfW0zb3KvByW5JLW2teSpKreluSoJBeNrHNUkje31lqST1fV7lMtl621j1fVAdPs96gkD+0fn5rknBh9FZigTZ/bNJjeeO+NE6oEAGDh5hsUP1xVP5pm/lnj81tre8+yn/2SXDYyvSXJA+axzn5JZuvius9UF9jW2hVVNW0NVXVMulbK3PWud51ldzC3Y6/dOukSAABgScwnKL5oEZ+vppk33lo5n3W2S2vt9UlenySHHnroYrWSsk5sOHXDYPrYCdUBAABLbc6g2FpbzKC4JcldRqb3T3L5dqwz7sqp7qlVtW+Sq3a4UoBFpCsqALCazGcwm8V0bpKDqurAqrpNkickOX1sndOT/E51HpjkunmMrHp6kqP7x0cned9iFg0AALCeLGtQbK39KMmzk3w4yReTvKO1dmFVPbOqntmvdka6UVYvSfKGJDf/2b2q3prkU0l+rqq2VNXv9YtOSnJEVV2c5Ih+GgAAgO0w38FsFk1r7Yx0YXB03mtHHrckz5ph2yfOMP+aJP9jEcsEAABYt5Y9KAKwLdcwAgAryXJfowgAAMAKJygCAAAwoOspwGpw9om3PD78+MnVAQCsC4IiwASMX5MIALCS6HoKAADAgKAIAADAgKAIAADAgKAIAADAgKAIAADAgFFPAVab0VtlJG6XAQAsOi2KAAAADGhRhBlsOHXDYPrYa7dOphAAAFhmWhQBAAAY0KIIsAJt+tymwfTGe2+cUCUAwHqkRREAAIABQREAAIABQREAAIABQREAAIABg9kArHZnnzicPvz4ydQBAKwZWhQBAAAYEBQBAAAYEBQBAAAYcI0iwCqz6XObBtMb771xQpUAAGuVFkUAAAAGBEUAAAAGBEUAAAAGBEUAAAAGDGYDsAqMD2ADALCUBEWAVW6bUVAPP35ClQAAa4WupwAAAAxoUQRYa84+cTithREAWCBBEWZw7LVbJ10CAABMhK6nAAAADAiKAAAADAiKAAAADAiKAAAADBjMBmCtMwoqALBAWhQBAAAYEBQBAAAYEBQBAAAYEBQBAAAYEBQBAAAYEBQBAAAYcHsMgDVu0+c2DaY3uj0GADAHQRF6G07dMJg+dkJ1wJIbva+i0AgATEPXUwAAAAYERQAAAAYERQAAAAZcowiwxowPXgMAsFCCIvSOvXbrpEuA5Tc6sE1icBsAIImupwAAAIwRFAEAABgQFAEAABgQFAEAABgwmA3AOjM6KurGe2+cYCUAwEolKAJwC6OgAgDR9RQAAIAxgiIAAAADgiIAAAADgiIAAAADgiIAAAADgiIAAAADbo/BurXh1A2D6WMnVAcAAKw0WhQBAAAY0KIIwMzOPnE4ffjxk6kDAFhWgiLr1rHXbp10CQAAsCIJigDMaNPnNg2mN2pRBIB1QVAEYP50RQWAdUFQBFjHtmkxvPfGCVUCAKwkgiIANxsPjnMt1xUVANYmt8cAAABgQIsiANttw6kbbn58wdEXTLASAGAxaVEEAABgQIsiANvN/UgBYG3SoggAAMCAoAgAAMCAoAgAAMCAoAgAAMCAoAgAAMCAoAgAAMCAoAgAAMCA+yiyvpx94qQrgLVr/N/X4cdPpg4AYIdpUQQAAGBAUAQAAGBAUAQAAGDANYoALA3XLALAqqVFEQAAgAEtiqwrmz63adIlAADAiqdFEQAAgAFBEQAAgAFdTwFYFONduzfee+NwBYPbAMCqISgCsCy2CZKCIgCsWIIiAEtirsGjNpy6YTB9wdEXLGU5AMACuEYRAACAAS2KAKwIoy2MWhcBYLIERQAm4thrtw6mT95j94nUAQBsS9dTAAAABgRFAAAABnQ9ZW0bv28bsGKNd0UFACZHUARgxXHrDACYLF1PAQAAGBAUAQAAGND1FIAVx/WKADBZWhQBAAAYEBQBAAAYEBQBAAAYWPagWFWPqKovV9UlVXXcNMurql7dL/98Vd13rm2r6oSq+mZVnd//PHK5jgcAAGCtWdbBbKpqpySvSXJEki1Jzq2q01trF42sdmSSg/qfByQ5OckD5rHtK1prf7NMhwLAMnJfRQBYXss96ulhSS5prX0tSarqbUmOSjIaFI9K8ubWWkvy6aravar2TXLAPLYFYA0yCioALK/l7nq6X5LLRqa39PPms85c2z6776p6SlXtMd2TV9UxVbW5qjZfffXV23sMAAAAa9pytyjWNPPaPNeZbduTk/xFP/0XSV6W5OnbrNza65O8PkkOPfTQ8edlLTj7xMHkps9tmlAhwFLa9Mq7DKY3PveyGdYEALbHcgfFLUlG/3ffP8nl81znNjNt21q7cmpmVb0hyT8vXskAAADry3J3PT03yUFVdWBV3SbJE5KcPrbO6Ul+px/99IFJrmutXTHbtv01jFMek+QLS30gAAAAa9Wytii21n5UVc9O8uEkOyU5pbV2YVU9s1/+2iRnJHlkkkuS3Jjkd2fbtt/1S6vqkHRdTy9N8oxlOygAAIA1Zrm7nqa1dka6MDg677Ujj1uSZ813237+Uxe5TABWsdHbabiVBgAs3LIHRQBYdGMDWQEAO2a5r1EEAABghdOiCMCqt82tcPbY/eaHo91QE11RAWA+BEUA1pxjr9168+OTR0IjADA/up4CAAAwICgCAAAwoOspAGvaaDdUAGB+BEXWlG0GtAAAABZM11MAAAAGtCiyurnJNrBAm155l8H0xudeNqFKAGDl0qIIAADAgKAIAADAgKAIAADAgGsUAVjfxq91Pvz4ydQBACuIFkUAAAAGtCgCsK6N3391oxZFABAUAWBAV1QA0PUUAACAIS2KADBCV1QAEBQBYHajXVGFRgDWCUERAGYx2sKodRGA9UJQZHUZH2QCAABYdIIiAMyXEVEBWCeMegoAAMCAFkUAmKdtRkQdX0ELIwBrhBZFAAAABrQoAsB2cs9FANYqQREAFovBbgBYI3Q9BQAAYECLIqvaeLcvAABgx2lRBAAAYECLIivb+PU+AKuJaxYBWKW0KAIAADAgKAIAADCg6ymrisFrgJVsm/sq3nvjhCoBgB0jKALAchm9ZtH1igCsYLqeAgAAMKBFEQAmwYioAKxgWhQBAAAYEBQBAAAY0PUUAJbIgkZB1RUVgBVEiyIAAAADWhQBYCXSwgjABAmKrDzjX44A1ojRrqizdkMFgAnT9RQAAIABLYoAMAELGugm0RUVgGUlKALACrTgIAkAi0hQBIBVaMOpG25+fMHRF0ywEgDWIkERAFaA8RbEuZYfO/J4NDQm2wbHuZYDwDhBEQBWuWOv3TrpEgBYY4x6CgAAwIAWRSbPfRMBFpfPVQB2kBZFAAAABrQoAsAas83AOHvsPpE6AFi9BEUAWOMMdgPAQgmKrGhzDRcPAKwu7gEKq4OgCADrzfhgN4cfP5k6AFixDGYDAADAgBZFAFjvRloYN3zjtMEiXQMB1idBEQC4mYFvAEgERSbBjaABJmp8oLCN994488quZwRYlwRFAACWzOgop8DqISgCwDo3262Itml9HF9hhbQwjocR11ZOjmAIa4OgCABsP11TAdYkQZGl55pEgDVrmxbHsaDo5uoAq5OgCADM22zdVJNtux3uyCiqupOuDrqawtokKAIAi2Ypb6+hdRJg+QiKAMCqoxULYGkJigDA8jDwDcCqISiy+AxeA8B8zPH/xWzdWE/eY/dZt3V9I8COERQBgGUx10A4rA66/cL6ICiy4vgiAcBc5ho0Z7zFUQsjwMIIigDAmjMeJOfqqgrAkKAIAKx5S3nbDoC1SFAEANaf0YF0jL4KsA1BEQBYd0avhz/5G6cNlrl+EUBQZLG4JQYAq9Q23VKX8H6PSzmojgF7gMUkKAIAzGYJg+Ns5roNxVIGQbfAAARFAIAR47dp2njvjcMVZutFs8AQuVJaAXckGM41wqyBhGB1EhQBABbLHJdizBWaFjOwLWRfixnmBENYGwRFJm78L7cAsJLM9f/UNi2OO2BBIWsHQynAbARFAIAdMBokx0PjUv4x1B9agaUkKAIALBLhDVgrbjXpAgAAAFhZBEUAAAAGBEUAAAAGXKPI9CZ0c2EAYB3xfQNWLEGR7TPHkNwAAAsmOMKKoespAAAAA1oUmZ9FbEE0dDgAAKxsgiIdXUkBgJVGV1SYGF1PAQAAGNCiCADA6qCFEZaNFkUAAAAGtCiuJXNdZzihv7oZvAYAWBJaGGHJaFEEAABgQFAEAABgQNdTAADWhtGuqLqhwg7RoggAAMCAFsX1bK7Bb7aTwWsAgPkY/86w8d4bJ1QJME5QBABg7TEiKuwQQREAgLVPcIQFERRZFLqbAgDA2iEoAgCwIiz0D887dE2jFkaYlaC4nizR4DUAAJOwI4PhbLPtbCsLkaxDguJqJvgBANxsthbJhbY+Dva10JbO5142mN5w6oabH19w9AUL2her0+hrnqzO111QBABgzZurW+uijrcw9sf8Y6/desvzvPIug2VzBdgdCb8bvnHarMtHrcYgs5KNvuarVbXWlvcJqx6R5FVJdkryxtbaSWPLq1/+yCQ3Jnlaa+2zs21bVT+T5O1JDkhyaZLHt9auna2OQw89tG3evHnxDmwSVlCLosFsAACYtPHwPPod9eQ9dh8su+CuT5px3emMbr/QIDjeyrxSVNV5rbVDp122nEGxqnZK8pUkRyTZkuTcJE9srV00ss4jk/zvdEHxAUle1Vp7wGzbVtVLk3y7tXZSVR2XZI/W2vNnq0VQ3DGCIQAAzM9qDIrL3fX0sCSXtNa+liRV9bYkRyW5aGSdo5K8uXUJ9tNVtXtV7ZuutXCmbY9K8tB++1OTnJNk1qC4WmzTv3nsLx+j5gpvs/2FZUe6PQAAAGvLcgfF/ZKMxukt6VoN51pnvzm23ae1dkWStNauqKq9p3vyqjomyTH95A1V9eXtOYgltleSb820sPKn273jZ82y7WzL1pFZzz1LyrmfHOd+cpz7yXL+J8e5nxznfkKe9Qe1Us/93WZasNxBsaaZN973daZ15rPtrFprr0/y+oVss9yqavNMzb8sLed+cpz7yXHuJ8e5nyznf3Kc+8lx7idnNZ77Wy3z821JMjrU0/5JLp/nOrNte2XfPTX976sWsWYAAIB1ZbmD4rlJDqqqA6vqNkmekOT0sXVOT/I71Xlgkuv6bqWzbXt6kqP7x0cned9SHwgAAMBataxdT1trP6qqZyf5cLpbXJzSWruwqp7ZL39tkjPSjXh6SbrbY/zubNv2uz4pyTuq6veSfCPJ45bxsBbbiu4au8Y595Pj3E+Ocz85zv1kOf+T49xPjnM/Oavu3C/7fRQBAABY2Za76ykAAAArnKAIAADAgKC4QlTVI6rqy1V1SVUdN+l61rKquktVnV1VX6yqC6vqOf38E6rqm1V1fv/zyEnXulZV1aVVdUF/njf3836mqj5SVRf3v/eYdJ1rTVX93Mj7+/yq+k5VPdd7f2lU1SlVdVVVfWFk3ozv86o6vv8/4MtV9fDJVL02zHDu/7qqvlRVn6+q91TV7v38A6rqeyPv/9dOrPA1YIZzP+NnjPf94pnh3L995LxfWlXn9/O97xfRLN8tV/VnvmsUV4Cq2inJV5Icke42IOcmeWJr7aKJFrZG9bdQ2be19tmqun2S85L8RpLHJ7mhtfY3k6xvPaiqS5Mc2lr71si8lyb5dmvtpP6PJXu01p4/qRrXuv5z55tJHpBu0DDv/UVWVb+c5IYkb26tHdzPm/Z9XlX3SvLWJIcluXOSjya5R2vtxxMqf1Wb4dw/LMnH+sHx/ipJ+nN/QJJ/nlqPHTPDuT8h03zGeN8vrunO/djyl6W7m8CLve8X1yzfLZ+WVfyZr0VxZTgsySWtta+11n6Q5G1JjppwTWtWa+2K1tpn+8fXJ/likv0mWxXp3vOn9o9PTfcBy9L5H0m+2lr7z0kXsla11j6e5Ntjs2d6nx+V5G2tte+31r6ebuTvw5ajzrVounPfWjuztfajfvLT6e7HzCKb4X0/E+/7RTTbua+qSvcH8bcua1HrxCzfLVf1Z76guDLsl+SykektEVyWRf8Xtfsk+fd+1rP7bkmn6Pq4pFqSM6vqvKo6pp+3T3/P1PS/955YdevDEzL8wuC9vzxmep/7f2B5PT3JB0emD6yq/6iqf6mqX5pUUWvcdJ8x3vfL55eSXNlau3hknvf9Ehj7brmqP/MFxZWhppmnT/ASq6rbJXlXkue21r6T5OQkd09ySJIrkrxsctWteQ9urd03yZFJntV3l2GZVNVtkjw6yT/1s7z3J8//A8ukqv4syY+SvKWfdUWSu7bW7pPkD5OcVlV3mFR9a9RMnzHe98vniRn+cdD7fglM891yxlWnmbfi3vuC4sqwJcldRqb3T3L5hGpZF6pq53T/kN/SWnt3krTWrmyt/bi19pMkb8gK7AKwVrTWLu9/X5XkPenO9ZV9H/+pvv5XTa7CNe/IJJ9trV2ZeO8vs5ne5/4fWAZVdXSSX0/y5NYP0tB3/bqmf3xekq8mucfkqlx7ZvmM8b5fBlV16yS/meTtU/O87xffdN8ts8o/8wXFleHcJAdV1YH9X/qfkOT0Cde0ZvX99P8uyRdbay8fmb/vyGqPSfKF8W3ZcVX1U/2F3qmqn0rysHTn+vQkR/erHZ3kfZOpcF0Y/GXZe39ZzfQ+Pz3JE6rqtlV1YJKDknxmAvWtWVX1iCTPT/Lo1tqNI/Pv2A/ulKr62XTn/muTqXJtmuUzxvt+efxaki+11rZMzfC+X1wzfbfMKv/Mv/WkCyDpR2B7dpIPJ9kpySmttQsnXNZa9uAkT01ywdQw0Un+NMkTq+qQdE3/lyZ5xiSKWwf2SfKe7jM1t05yWmvtQ1V1bpJ3VNXvJflGksdNsMY1q6p2SzfC8uj7+6Xe+4uvqt6a5KFJ9qqqLUlemOSkTPM+b61dWFXvSHJRum6Rz1ppo9+tJjOc++OT3DbJR/rPn0+31p6Z5JeTvLiqfpTkx0me2Vqb72AsjJnh3D90us8Y7/vFNd25b639Xba9Jj3xvl9sM323XNWf+W6PAQAAwICupwAAAAwIigAAAAwIigAAAAwIigAAAAwIigAAAAwIigAsWFWdUFWtqj48zbJ3VtU5y1jLQ/taDl6u51yIqrpnVX2iqr7b13nAdu7nnKp65yKXt+iqau/+/XHApGsBYPsJigDsiIdV1f0nXcQK99dJdk/y6CQPSnLFRKtZenunu3feAROuA4AdICgCsL2+neTzSf5s0oUsparaZQd38fNJPtJaO6u19unW2vcXo67FVFW7TrqGmazk2gDWMkERgO3VkvxlkkdX1YaZVuq7IX5rmvmtqp49Mn1pVf1NVR1XVVdU1XVV9bLqPLKqLqyq66vqvVW1xzRPdeeq+ue+i+c3quqZ0zznQ6rqX6rqxqq6pqreUFW3H1n+tL6uw/qunt9L8rxZju2Qqjqr39+1VfWWqtqnX3ZAVbUkd0/yB/1+z5llXztV1fFV9ZWq+n5VbamqN02z3pOq6pKq+k5VfbCq9h9bflJVXVBVN/T7eEtV3WlsnUv7c/vnVbUlyXf6+Q+qqtOr6vL+PJ5fVU+epoa7VdVbq+pb/bF/vq/rgCQX9Kud3R9zG9nuZ6rqdVV1ZVXdVFWfrKoHjO27VdUfVtUrq+rqqf31r90n+uP+Tl/b42Y6nwDsmFtPugAAVrV/SvKidK2KT1iE/T0hyWeS/G6S+yX5v+n+qPnLSf48ya5J/jbJiUnGg+DfJfmHJP8vyW8mObmqtrTW/jlJqurBSc5K8t4kv5VkzyQnJdmjnx711iQn98e2dbpCq+qOSc5J8sUkT0pyu35/H6mqQ9N1MX1Qkvck+Vhf13dmOfbXJfmdJC9N8i9Jfmaauh6Q5M5J/qg/F69K8vokjxxZZ+90Af7yJHfs1/1YVW1orf14ZL0nJbkwycbc8n3gbkn+Lclrk9yU5MFJ/r6qftJae2t/3Hsn+VSSG5P8cZLLkhyc5C79MT85yVuSPCvJZ0fO122TfDRdN9znJbkqybFJPlpVB7XW/muktucl+XiSpya5VVXdIck/J3lfkhcnqSQb+n0BsAQERQC2W2vtJ1V1UpK/q6oXtNa+soO7vCnJ4/pA86GqOirJ/05yUGvt60lSVfdOcnS2DYofbK39af/4w1X1s0n+T7qAkXQh7pOttd+e2qCqvpnkrKo6uLX2hZF9vbq19qo5av2j/vfDW2tTLXJfSfLvSR7bB6tPV9X3k1zRWvv0TDuqqp9P8ntJntNae/XIorePrXqHJI9qrV3bb3enJK+oql1ba99Lktba00f2u1O6ULclXej7+Nj+fr21dtPURGvtbSPbVr/+/kl+P114TpI/SPLTSe7XWpu63vKske0+3z+8aOyYn5IuUP731trF/bofTfLldOdytOX2v8Zep0P753x2a+36fvaZAWDJ6HoKwI76xyTfSHL8IuzrnLFWr0uSXDoVEkfm3bGqbjO27XvGpt+d5H59l87d0rXuvaOqbj31k+Rfk/wwXevlqA/Mo9bDkpw5FRKTpLX2mSSXJnnIPLYfdXj/+01zrHfuVEjsXdT/3m9qRlUd2XfpvC7Jj9KFxCS5x9i+zhoNif22e1TVq6vqP9Odlx8mOWZs219N8qGRkDhfv5bkvCRfHzn/Sdd6eujYuuPn/6tJbkhyWlUdVVW7L/C5AVggQRGAHdJa+1G67pJPqaq77eDuto5N/2CGeZVkPCheNc30rZPsla576U5JNuWWAPTDJN9PsnO6bpOjrpxHrfvOsN6V6bqNLsSeSb47GjpnsHVs+gf9712SpLoRaE9PFw6fmi4cP3B0nbE6x70pyW+nG6n1YUnun+SUsW33zPaN3LpXX8sPx35+N3Oc/z4cPyzda/WOJFdX1Qf6VmMAloCupwAshlPSdfN8/jTLbspYqKvpB6PZUXtPM/2jJN9KF3RakhOSnDHNtpePTbdp1hl3xTTPmST7pGs5W4hrkvxUVd1hHmFxNo9JcnWS326ttaQbeGaGdQfHWN3oro9K173ztSPzx/+ofE26kLxQ306yOd11iePGR4Ld5vy31j6V5BHVjYL6a0lenuS03BKEAVhEWhQB2GH9LR/+JsnTs22I2JLk9lW138i8hy1BGY+ZZvq81tqPW2vfTfLpJD/XWts8zc94UJyPf0/y8LFRU++f7v6B/7rAfX2s//0721HHqF2T/HAqJPa2GbV0BrdN1+p6c2jrj+3RY+udle6495lhP4NWzrHt/luSb0xz/i/IPLXWvtdae3+6P07ca77bAbAwWhQBWCyvS/KnSX4x3XVnUz6U5HtJTqmqlyU5MNsORLMYjqyql/TP/ZtJjkhy1MjyP0k3cM1PkrwzyfVJ7pquFe3PtmMgnpenax37cFX9VW4Z9fSCJO9ayI5aa1+uqtcneVk/qujH043o+VuttYWMJvuRJM+tqlcmeX+61+Ip86zhuqo6N8kLquo7SX6S5Lgk16UbRGfKK9IF2k/05/uyJPdM8lOttZemu171e0mO7q+T/GFrbXOSN6d73c+pqr9J8rV03VgPSzd4zStmqq2qHpXujxDv7fe/X5Jn5JaADcAi06IIwKJord2YLkSMz/9WksemGz3zvemCy5OWoIT/leS+/XP8epJntdZOH6njX9PdZuOO6W6j8f504fGyzO+axIHW2tXpBqG5Kd2IoK9J8okkR7TWfjDbtjPYmO52HE9J1z32lekC10JqOiNd99/HprtW8VfSnYv5elKSr6cLda9KF3jfPPYcV6cbQfU/+hr/Od2AN9/ol9+UbpTU+6UL7eeOzD88XZh9UbpRS1+V5KB0t0SZzSW55b6dZ6a7JvZD6cIjAEughr1TAAAAWO+0KAIAADAgKAIAADAgKAIAADAgKAIAADAgKAIAADAgKAIAADAgKAIAADAgKAIAADDw/wNyV70NnUsoFgAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 1080x720 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "pal = sns.color_palette()\n",
    "\n",
    "plt.figure(figsize=(15, 10))\n",
    "plt.hist(dist_train, bins=200, range=[0, 200], color=pal[2], density=True, label='train')\n",
    "plt.hist(dist_test, bins=200, range=[0, 200], color=pal[1], density=True, alpha=0.5, label='test')\n",
    "plt.title('Normalised histogram of character count in questions', fontsize=15)\n",
    "plt.legend()\n",
    "plt.xlabel('Number of characters', fontsize=15)\n",
    "plt.ylabel('Probability', fontsize=15)\n",
    "\n",
    "print('mean-train {:.2f} std-train {:.2f} mean-test {:.2f} std-test {:.2f} max-train {:.2f} max-test {:.2f}'.format(dist_train.mean(), \n",
    "                          dist_train.std(), dist_test.mean(), dist_test.std(), dist_train.max(), dist_test.max()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以看到，大多数问题都有15到150个字符。似乎测试分布与训练分布有点不同，但不是太大（我不知道它是否只是减少噪声的较大数据，但似乎分布在测试集中要平滑得多）。\n",
    "有一件事吸引了我的眼球，那就是对于大多数问题来说，训练集在150个字符处有一个陡峭的截止线，而测试集在150个字符后会慢慢减少。这可能是某种Quora问题的大小限制吗？\n",
    "另外值得注意的是，我已经将这个直方图截断为200个字符，并且对于这两个集合，分布的最大值都略低于1200个字符——尽管超过200个字符的样本非常罕见。 让我们做同样的字数计算。我将使用一种简单的方法来拆分单词（在空格上拆分，而不是使用一个严肃的标记器），尽管这仍然会让我们对分布有一个很好的了解。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:24.701921Z",
     "iopub.status.busy": "2021-07-10T11:53:24.701628Z",
     "iopub.status.idle": "2021-07-10T11:53:32.712235Z",
     "shell.execute_reply": "2021-07-10T11:53:32.711128Z",
     "shell.execute_reply.started": "2021-07-10T11:53:24.701892Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mean-train 11.06 std-train 5.89 mean-test 11.02 std-test 5.84 max-train 237.00 max-test 238.00\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAA4MAAAJjCAYAAAC2pMTMAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAA8sUlEQVR4nO3de7htZV0v8O/PDQYIisEWEVCwQyaBoSKpqEkFgplY3o3SOidSMLVSAzt5qUy8ZOY5bhANxaNo5JUUAyRQTBE2hCKggkSxBWWLbsQLIPKeP8ZYMlmsvdd9rrXX+HyeZz1rznfcfnPOMfezvvt9xzuqtRYAAACG5W5LXQAAAADjJwwCAAAMkDAIAAAwQMIgAADAAAmDAAAAAyQMAgAADJAwCCypqnp1VbWqOn2KZR+sqnOWoKxZq6rH969j75G2VlUvHMOx9+6P9fjZ1LeR9d5dVWtnceyDq+olMy52IKrqlVX1jaq6varevdT1bExVvamqrl7qOmajqu7e/7ux7wzWfV5/3m87htLGamPfvdl+h4Fh22KpCwDoHVxVj2itXbDUhSygRyX5z6UuYpb+OsnWs1j/4CRPS/KWRalmM1RV+yV5TZJXJDknyfVLWtDKc/ckr0pydZKLp1n3E+m+hz9c3JKWxMa+e7P9DgMDJgwCy8F3kqxL8hdJnrLQO6+qrVtrP1ro/U6ntXbeuI85X621ry91DdOpqi2T3N5a+8lS17IRv9D/fltr7XtLWkmW7vxfDlpr65OsX+o6xmlz+A4Dy4dhosBy0JL8bZInV9U+m1qxqvatqrOq6odV9d2qel9V7TSyfPd+WNjvVNV7qmpDkn8ZaX9WVb2rqr5XVeuq6vB+u5dX1bVVtb6qXl9VdxvZ5y9U1Qeq6pr+uJdW1UtG19lIrXcaJlpVj6mqc/tjf6+qLq6qp0/a5n/1+7+lqv6rql4+xX6P7Gv5QVX9S5KdN/nu3tmOVfXPVfX9qrqqqo6ctO87DTGrqu2r6p39e3NzVf13Vb2jX/bqJH+W5AH9a22jQyKr6hlVdUn/Wq6pqtdW1RaTjvf4qvpSv+8Lqmr/qvp2v++Jdc7phwwfUVVfT3JzkvvN5HMZGR77a1X1sf49u6IfYreqqt7YH+8bVfWn0715/Tav7t+HW/pjPmf0/Uvy//qnN9ZGhu9W1QP7ZY8eaXt/3/aQkbZ/qar3jTzfo6o+2p8/N/XL/8ekfbeq+tOqektVrU9yychneXL/HlxXVX8x3esd2efjqurs/ry5sf9MHjqyfLrv5ZTDlCc+29H3r6rWVtVB/Xnxg6r6bFX94shmN/W/3zVy3u2+kbrvNEy07vh34BlV9fb+tayrqtfU9N/n6j/76/v3/j1V9ZzR48/0dfZtj6mqT/fv2Q1V9Y6q2m5k+Zy+ezXFMNEZfD4zel+qateqOqV/D35UVV+vqr/e1PsGLG/CILBc/HOSr6XrHZxSVa1ON+xumyTPSfLHSX4lyZlVdfdJq78p3R+NT08XNCe8Psl1SZ6a5NwkJ1XV3yXZP8kfpBty9fIkzxjZZpckX01yZJInJnlHumGAfz7TF1dV90zy8SRX9cd+WrrQsP3IOi9LclySjyZ5Uv/4r+vOgfKwJG/r9/Xb6f7QP3GmdfS1fzHJb6V7L99WVftvYv03J3lMkj9J8oR0Qx9bv+ydSU5O8s10Q/EelW6IWqrq4CT/lOSiJIcl+T9JXprk/468ll2SnJZuGOXTkrw9yfsy9RC3A5K8IN17/ptJbszsPpe3J/ls/7r/K8kH+1q2S3cufTDJ31XVIzfxXiTJX6U7R09I8uQk/57kfVX17H75Xyf5m/7xr/bvyUWTd9JauyrJN5I8dqT5semC7mOTLnz0r/vc/vnPJDkryYOT/GGS5yXZI8mnq+pnJx3iZen+k+B3k7yob3tXkkOTvCTJEemGGT5rmteb6sLsWUl+nOS5SZ7Z17RLv3w238uZuH+SNyZ5bZJnJ7lPklP69yPp3teke58nzrvrZnmMNyT5frrz7r1JXtk/3pQX9eud0K/7o34/s1ZVB6R7T7/Z7+sl6c7hd42sNqfv3hTHms3nM9378p4ku6U7fw5N9xn9zMxfObDstNb8+PHjZ8l+krw6ybf7x89L8pMkP98//2CSc0bWPTbJhiT3HGnbP90fSM/un+/eP//IpONMtL9rpO2e6f7AvSLJqpH285P800bqrXRD7F+R5KqR9sf3+997pK0leWH/eL/++XYb2e890/0R9qpJ7X+V7g++VSO1fXLSOu/o9/34TbzPE/X91UjblumG0B070vbuJGtHnn85yR9vYr9vSnL1FO3nJTl7UtvL+8931/75G5N8O8nWI+s8o6/z1SNt56T7w/u+m6hjus/lVSNte/Vt/zbSdrf+fX79Jo7xs0l+MMVndFqSr448f16//22nOfffn+Tj/eMH9u/NmiQf6Nse0u/nF/vnz09yW5IHjuxj1yS3Jjlm0nn3H5OO9Yt9+zNH2rZNN0T7Lp/fpG0/n2RtktrI8pl8Lyc+h70nbXtOkg9OOv9uS7LnSNtT+m1/YaTuluR5m6p7qs8id/w78J5J61088b5vZD+rklyb5LhJ7Wf2+9t9lq/z3Nz1+/Gro9tm7t+9d+fO3+HZ/Lu5yfcl3b9Rvznd++7Hj5/N50fPILCcvDfJfyc5ZiPL909yRhu5Dqu1dn66iSQeM2ndT2xkH2eNbPu9dGHo0+3O159dmb7XI0mqaqt+uNSVSW5JFyBfm2SPmjTscRO+nu4PqZOr6rCq2n7S8kcluUeSf66qLSZ+kvxbkp2S7FpVq5I8NMnHJm374RnWkCRnTDxorU0E4V03sf7FSV5W3dDUn5/JAfo6H5aut3fUP6ULXY/qnz8iyZntzteznbqR3V7YWvvmpOPM5nM5a+Txlf3vf5toaK3dnq7Xdpds3N7pelemel0/X1X32cS2Uzk3yQH9MLzHJflSkn/JHb2Fj0sX1i7rn++f5KLW9SpO1L0uXe/kdOf/I/rfP31/W2vfTxdmNqqq7pHkl5Oc1FprG1ltNt/Lmbi6tXbFyPOJ17+p83S2zpj0/LJp9r9bup7W+Xz3kiRVtU2678Apk77rn013Dj+8X/XizPK7txGz+Xyme18uTvK6fvjt/edRE7BMCIPAstFauy3dMKXDq+oBU6yyc5JvTdH+rXS9NpPbprJh0vNbN9K21cjz16cb4nhCuqFcj8gdQwG3ygy01r6bbljelklOSbK+qj5RVQ/sV9mx/31puj8IJ37O7tt3S7I6Xe/X5NkpZzNb5YZJzye/1slemG7Y6iuTfLW66+2mG1q4Y7rXOfkzmHg+8VndN5Mm92it3ZwuNE821ec5m89lw8gxbp3c1pvuvZi4NnNjr+vem9h2Kp9JN0x473QB8Nx0we6+/Xnx2CSfHQlh8zn/75vkpnbXiWSmO3funa7XdVPDMGdT10xsmPR84vOa0XdtHsfY1P7v2/+ez3dvwr3T9TSuyZ2/67ek+97s1q83l+/eVGbz+WyY9Hzy+/LMdL3Ef5/kv6q77vnX5lATsEwIg8Byc2K6P7Cmuu7runTXD022U7oelFEb68WYi6cn+T+ttTe01j7VWlubbijbrLTWPt9aOyRdAPjtJD+f7rqf5I76n5Qu1Ez++WK64HRb7voezLZHajY1b2itvai1dt8kv5TkC+mukdtrE5t9O90ft5PrmpiwYuK1fjNdwP2pqtoq3TDAu5QyRduCfC6zMBGIpntdM3Vpv81j0/UCfqbvvflS3zYREEePP9fz/5tJtquqyddjTnfufDfJ7dn0JEUzqevm/vfka9TmEhaXwkSv9HTfvZm8zg3phy5n6u/6icmcv3tTmc15s0mttW+01p6XZId0vZvfTHJqVe0wy5qAZUIYBJaV1tot6a6F+YPc9Q/QLyR5wqQZ9x6R7nqXzy5iWVun+1/7iWOuygwm3tiY1tqPWmv/ku6Pvok/7D6f7rq4+7XW1k7xc1M/lPXidBOyjPrtudYyy7q/lG5ikrvljtsn3KVHpa/zwnRhbdQz0gWLz/fPL0hy0KSA8uRZlLSgn8sMfDnd/eqmel1fa91tDGas7/H79377/5GupzD974nzfzQMfiHJw6tqj4mGfhKeR2f683/i/p0/fX+rm2HzoGlq/EF/3N8bmcBlspl8L9f1vx88ss5uSR40Td1TWYyewulcky74TPfdm/Z19u/peUketJHv+rWTDz7T795GLPi/m62121t365zXpBs6PdVIDmAz4D6DwHL09nQTgTw6yadH2t+cbkbJ06vq9el6kI5NN6PmhxaxnjOTHNVfm/adJEdlljPoVdVvpPsD/6PprovcJckfpb9urbW2oZ8u/h/6IbKfSfeH388nObC19lv9rv42yYer6rgkH0k3K+Ah83lx09T92f44X07Xm/GH6SZROb9f5StJdqqq5/XrfLu1dnW6Xo/Tq+pdST6QZJ90sx2+o7/OLelmbj0q3a0//j7dULyj0wWu22dQ3rw/l9lorX2nqt6S5H9X1W3phsv9drohqs/e1Lab8Jl0E+l8tbU2MeTw3HQzV/4wd56J9N3pesw/WVWvTDfhzKvT9cS+fZraL62qU5Mc189se126cDGTm7EfneRT/XFPSPf5PyrdJCUfzwy+l621dVV1QbrZcX+Y7tx+RWbfm5rW2q1V9Z9JnlFVX07XG/elkeG/C6619pOqekOSN1XVt9N9Rk/NSOjr15vp63x5krOq6vZ0E2XdlG4W1d9I8hetta/N47s32YL8u1lV90pyeroZRb+W7rv2Z+lC8uUz3Q+wvOgZBJad1toP012TMrl9fZID0/3x9/50t1g4N8lBi/mHYLqp2M/tj3diuj+8XjfLfVyZO+6neEa6ayP/NV1ATJK01t6QO6Zs/1i61/g7Gekdaq19pK/nN9MFy4cm+Z+zf0kz9vl0MzJ+MN21jjsmOXQk0J2SLqS8IV3v06v7Os9I10u3X7pJUV6S5O/SXQc18Vq+ke6P3/ukm4jjj9O9H6uSzORm7QvxuczWK/tjvCDd7T0el+Tw1toH5ri/ic/2M1O0faGf5CfJT3vNfz1dCPjHJCelu03G41trMwlVz0t37r2l3/6sdEF9k1prn0nXg7hNukme/indf0Ks65fP9Hv5nHT/EfLedN+Dv0p3a5C5eH66c/FT6c67+81xP7PxlnR1Pz9diNo2XaibbNrX2Vr7bLpzZ3W6W8z8S7+va3LH9X1z+u5NtoD/bt6cLkC+ON1ERCel+8+Eg6e4FhXYTNTGJwcDgPGqqsek+0P1V1trZ0+3PiylqnpSuiC3x0Z65QCWNcNEAVgy/bC1/0g31OxBSf4y3QQqn97UdgDA/AmDACyln0l3zdxO6a6bOiPJn/b3/QMAFpFhogAAAANkAhkAAIABWvHDRHfccce2++67L3UZAAAAS+LCCy/8dmtt9eT2FR8Gd99996xdu3apywAAAFgSVfVfU7UbJgoAADBAwiAAAMAACYMAAAADtOKvGQQAAIbrxz/+cdatW5ebb755qUtZdFtttVV23XXXbLnlljNaXxgEAABWrHXr1mW77bbL7rvvnqpa6nIWTWstN9xwQ9atW5c99thjRtsYJgoAAKxYN998c3bYYYcVHQSTpKqyww47zKoHVBgEAABWtJUeBCfM9nUKgwAAAAPkmkEAAGAw9jlpnwXd3yXPvWSTyzds2JCTTz45Rx555Kz2+8QnPjEnn3xytt9++3lUt2l6BgEAABbJhg0bsmbNmru0/+QnP9nkdqeddtqiBsFEzyAAAMCiOfroo/P1r389++67b7bccstsu+222XnnnXPxxRfnsssuy1Oe8pRcc801ufnmm/PiF784RxxxRJJk9913z9q1a/P9738/hx56aB7zmMfkc5/7XHbZZZd87GMfy9Zbbz3v2vQMAgAALJJjjz02P/dzP5eLL744b3zjG3P++efnta99bS677LIkyYknnpgLL7wwa9euzVvf+tbccMMNd9nHFVdckaOOOiqXXnpptt9++3zoQx9akNr0DAIAAIzJ/vvvf6f7AL71rW/NRz7ykSTJNddckyuuuCI77LDDnbbZY489su+++yZJHv7wh+fqq69ekFqEQQAAgDG5xz3u8dPH55xzTj71qU/l85//fLbZZps8/vGPn/I+gT/zMz/z08erVq3Kj370owWpxTBRAACARbLddtvlpptumnLZjTfemHvf+97ZZptt8pWvfCXnnXfeWGvTMwgAAAzGdLeCWGg77LBDDjjggOy9997Zeuuts9NOO/102SGHHJLjjz8+D3nIQ/KgBz0oj3zkI8daW7XWxnrAcdtvv/3a2rVrl7oMAABgCVx++eV58IMfvNRljM1Ur7eqLmyt7Td53bEPE62qQ6rqq1V1ZVUdPcXyX6iqz1fVLVX10pH23arq7Kq6vKouraoXj7dyAACAlWOsw0SralWStyU5KMm6JBdU1amttctGVvtOkhclecqkzW9L8mettYuqarskF1bVmZO2BQAAYAbG3TO4f5IrW2tXtdZuTfKBJIeNrtBau761dkGSH09qv661dlH/+KYklyfZZTxlAwAArCzjDoO7JLlm5Pm6zCHQVdXuSR6a5AsbWX5EVa2tqrXr16+fS50AAAAr2rjDYE3RNqsZbKpq2yQfSvKS1tr3plqntXZCa22/1tp+q1evnkOZAAAAK9u4w+C6JLuNPN81ybUz3biqtkwXBN/XWvvwAtcGAAAwGOO+z+AFSfasqj2SfCPJs5I8ZyYbVlUl+cckl7fW3rx4JQIAACvW2a9b2P0deMwmF2/YsCEnn3xyjjzyyFnv+i1veUuOOOKIbLPNNnOtbpPGGgZba7dV1QuTnJ5kVZITW2uXVtXz++XHV9V9k6xNcs8kt1fVS5LsleQhSX43ySVVdXG/y1e01k4b52uAmdrnpH3mtf24b4gKAMDC27BhQ9asWTPnMHj44YevjDCYJH14O21S2/Ejj7+ZbvjoZJ/N1NccAgAALEtHH310vv71r2fffffNQQcdlPvc5z455ZRTcsstt+S3fuu38prXvCY/+MEP8oxnPCPr1q3LT37yk/zlX/5lvvWtb+Xaa6/NgQcemB133DFnn332gtc29jAIAAAwFMcee2y+/OUv5+KLL84ZZ5yRD37wgzn//PPTWsuTn/zkfOYzn8n69etzv/vdL5/4xCeSJDfeeGPuda975c1vfnPOPvvs7LjjjotS27gnkAEAABikM844I2eccUYe+tCH5mEPe1i+8pWv5Iorrsg+++yTT33qU/nzP//znHvuubnXve41lnr0DAIAAIxBay3HHHNM/uiP/uguyy688MKcdtppOeaYY3LwwQfnla985aLXo2cQAABgkWy33Xa56aabkiRPeMITcuKJJ+b73/9+kuQb3/hGrr/++lx77bXZZpttcvjhh+elL31pLrroortsuxj0DAIAAMMxza0gFtoOO+yQAw44IHvvvXcOPfTQPOc5z8mjHvWoJMm2226b9773vbnyyivzspe9LHe7292y5ZZb5rjjjkuSHHHEETn00EOz8847L8oEMtVaW/CdLif77bdfW7t27VKXwQC5tQQAwNK7/PLL8+AHP3ipyxibqV5vVV3YWttv8rqGiQIAAAyQMAgAADBAwiAAALCirfRL4ybM9nUKgwAAwIq11VZb5YYbbljxgbC1lhtuuCFbbbXVjLcxmygAALBi7brrrlm3bl3Wr1+/1KUsuq222iq77rrrjNcXBgEAgBVryy23zB577LHUZSxLhokCAAAMkDAIAAAwQMIgAADAAAmDAAAAAyQMAgAADJAwCAAAMEDCIAAAwAAJgwAAAAMkDAIAAAyQMAgAADBAwiAAAMAACYMAAAADJAwCAAAMkDAIAAAwQMIgAADAAAmDAAAAAyQMAgAADJAwCAAAMEDCIAAAwAAJgwAAAAO0xVIXAExtn5P2mfO2lzz3kgWsBACAlUjPIAAAwAAJgwAAAAMkDAIAAAyQMAgAADBAwiAAAMAACYMAAAADJAwCAAAMkDAIAAAwQMIgAADAAAmDAAAAAyQMAgAADJAwCAAAMEDCIAAAwABtsdQFwEr1gu9umNf2x917+wWpAwAApqJnEAAAYICEQQAAgAESBgEAAAZIGAQAABggE8jAJuxz0j5z3vYFC1gHAAAsND2DAAAAA6RnEJap+d6aAgAANkXPIAAAwAAJgwAAAAMkDAIAAAyQMAgAADBAwiAAAMAACYMAAAADJAwCAAAMkDAIAAAwQMIgAADAAAmDAAAAAyQMAgAADJAwCAAAMEDCIAAAwAAJgwAAAAMkDAIAAAyQMAgAADBAwiAAAMAACYMAAAADJAwCAAAMkDAIAAAwQMIgAADAAAmDAAAAAyQMAgAADJAwCAAAMEBjD4NVdUhVfbWqrqyqo6dY/gtV9fmquqWqXjqbbQEAAJiZsYbBqlqV5G1JDk2yV5JnV9Vek1b7TpIXJXnTHLYFAABgBsbdM7h/kitba1e11m5N8oEkh42u0Fq7vrV2QZIfz3ZbAAAAZmbcYXCXJNeMPF/Xty3otlV1RFWtraq169evn1OhAAAAK9m4w2BN0dYWetvW2gmttf1aa/utXr16xsUBAAAMxbjD4Loku4083zXJtWPYFgAAgBHjDoMXJNmzqvaoqrsneVaSU8ewLQAAACO2GOfBWmu3VdULk5yeZFWSE1trl1bV8/vlx1fVfZOsTXLPJLdX1UuS7NVa+95U246zfgAAgJVirGEwSVprpyU5bVLb8SOPv5luCOiMtgUAAGD2xn7TeQAAAJaeMAgAADBAwiAAAMAACYMAAAADNPYJZIDFt89J+8x520uee8kCVgIAwHKlZxAAAGCAhEEAAIABEgYBAAAGSBgEAAAYIGEQAABggMwmCivQC767YalLAABgmdMzCAAAMEB6BmET9LABALBS6RkEAAAYIGEQAABggIRBAACAARIGAQAABkgYBAAAGCBhEAAAYICEQQAAgAESBgEAAAZIGAQAABggYRAAAGCAhEEAAIABEgYBAAAGSBgEAAAYIGEQAABggIRBAACAARIGAQAABkgYBAAAGCBhEAAAYICEQQAAgAESBgEAAAZIGAQAABggYRAAAGCAhEEAAIABEgYBAAAGSBgEAAAYIGEQAABggIRBAACAARIGAQAABkgYBAAAGCBhEAAAYICEQQAAgAESBgEAAAZIGAQAABggYRAAAGCAhEEAAIABEgYBAAAGSBgEAAAYIGEQAABggIRBAACAARIGAQAABkgYBAAAGCBhEAAAYICEQQAAgAESBgEAAAZIGAQAABggYRAAAGCAhEEAAIABEgYBAAAGSBgEAAAYIGEQAABggIRBAACAARIGAQAABkgYBAAAGCBhEAAAYICEQQAAgAESBgEAAAZIGAQAABggYRAAAGCAhEEAAIABEgYBAAAGSBgEAAAYIGEQAABggIRBAACAARIGAQAABkgYBAAAGKCxh8GqOqSqvlpVV1bV0VMsr6p6a7/8S1X1sJFlf1JVl1bVl6vq/VW11XirBwAAWBnGGgaralWStyU5NMleSZ5dVXtNWu3QJHv2P0ckOa7fdpckL0qyX2tt7ySrkjxrTKUDAACsKOPuGdw/yZWttataa7cm+UCSwyatc1iS97TOeUm2r6qd+2VbJNm6qrZIsk2Sa8dVOAAAwEoy7jC4S5JrRp6v69umXae19o0kb0ry30muS3Jja+2MqQ5SVUdU1dqqWrt+/foFKx4AAGClGHcYrCna2kzWqap7p+s13CPJ/ZLco6oOn+ogrbUTWmv7tdb2W7169bwKBgAAWInGHQbXJdlt5PmuuetQz42t8+tJ/rO1tr619uMkH07y6EWsFQAAYMUadxi8IMmeVbVHVd093QQwp05a59Qkv9fPKvrIdMNBr0s3PPSRVbVNVVWSX0ty+TiLBwAAWCm2GOfBWmu3VdULk5yebjbQE1trl1bV8/vlxyc5LckTk1yZ5IdJfr9f9oWq+mCSi5LcluQ/kpwwzvoBAABWirGGwSRprZ2WLvCNth0/8rglOWoj274qyasWtUAAAIABGPtN5wEAAFh6wiAAAMAACYMAAAADJAwCAAAM0NgnkIFx2+ekfea87QsWsA4AAFhO9AwCAAAMkJ5B4M7Oft3ctz3wmIWrAwCARaVnEAAAYICEQQAAgAESBgEAAAZIGAQAABggYRAAAGCAhEEAAIABEgYBAAAGSBgEAAAYIGEQAABggIRBAACAARIGAQAABkgYBAAAGCBhEAAAYICEQQAAgAESBgEAAAZIGAQAABigLZa6AFhsL/juhqUuAQAAlh09gwAAAAMkDAIAAAyQMAgAADBAwiAAAMAAmUAGuJM1X1wz522PPPCYBawEAIDFpGcQAABggIRBAACAARIGAQAABkgYBAAAGCBhEAAAYIBmFQar6klVJUACAABs5mYb7D6W5BtV9fqqevBiFAQAAMDim20Y/LkkJyR5RpIvV9Xnq+oPq+qeC18aAAAAi2VWYbC1dnVr7VWttT2SHJTkyiR/n+S6qvp/VXXgYhQJAADAwprz9X+ttX9rrf1ukp9PcmGS30nyqar6z6r6k6raYqGKBAAAYGHNOQxW1a9U1buTfDXJ3kneluTgJP+c5DVJ3rMQBQIAALDwZtV7V1UPSPLc/mf3JOckOSLJh1trt/SrnVVVn0/y3oUrEwAAgIU026GcVyW5Nsm7k5zYWvvPjax3aZLz51EXAAAAi2i2YfA3k/xra+32Ta3UWvtaEpPJAAAALFOzvWbwaUkeMNWCqnpAVZ04/5IAAABYbLMNg89Nsnojy3bslwMAALDMzTYMVpK2kWV7J1k/v3IAAAAYh2mvGayqFyd5cf+0JfloVd0yabWtkuyUbmIZAAAAlrmZTCBzWZIPpesV/NMkZye5btI6tyb5SpJTFrQ6AAAAFsW0YbC1dmaSM5Okqm5K8s7W2jcWuzAAAAAWz6xuLdFae81iFQIAAMD4zOSawVOSHNNa+3r/eFNaa+2ZC1MaAAAAi2UmPYOrk2zZP75PNj6bKAAAAJuJmVwzeODI48cvajUAAACMxWzvMwgAAMAKMJNrBo+czQ5ba2vmXg4AAADjMJNrBv/vLPbXkgiDAAAAy9xMrhk0lBQAAGCFEfQAAAAGaCbXDO6V5OuttVv6x5vUWrtsQSoDAABg0czkmsEvJ3lkkvP7xxu7z2D1y1YtTGkAAAAslpmEwQOTXDbyGAAAgM3cTCaQ+fRUjwEAANh8zaRn8C6q6kFJHpFk5yTXJVnbWvvKQhYGAADA4plVGKyqeyZ5R5KnppuJ9PtJtk1ye1V9OMn/aq19b8GrBAAAYEHN9tYSa5IcnOT3kmzTWrtnkm2SPDfJQXHDeQAAgM3CbIeJHpbkT1prJ080tNZuTvK+qtomyZsXsjgAAAAWx2x7Br+f7hrBqVyb5AfzKwcAAIBxmG0YfFuSl1bV1qONfa/gS2OYKAAAwGZh2mGiVfWGSU17Jrmmqs5Mcn2S+6S7XvBHSdYueIUAAAAsuJlcM/j0Sc9/3P88cqTtpv73U5O8bAHqAgAAYBHN5Kbze4yjEAAAAMZnttcMAgAAsALM9tYSqapKckCSn0+y1eTlrTWTyAAAACxzswqDVbVTkrOS7JWkJal+URtZTRgEAABY5mY7TPTvktyYZLd0QfCXk+ye5C+TXJGutxAAAIBlbrbDRH8lyYtzx43nq7X230n+tqrulq5X8AkLWB8AAACLYLY9g9snWd9auz3J99LdY3DC55I8eoHqAgAAYBHNNgz+Z5Kd+8eXJvmdkWW/meQ7C1EUAAAAi2u2w0Q/keTgJKck+ZskH6uqdeluQn//JH8+3Q6q6pAk/5BkVZJ3ttaOnbS8+uVPTPLDJM9rrV3UL9s+yTuT7J1u0po/aK19fpavAVgsZ79u7tseeMzC1QEAwLRmFQZba8eMPP5kVR2Q5LfS3WLizNbaJze1fVWtSvK2JAclWZfkgqo6tbV22chqhybZs//55STH9b+TLiT+a2vtaVV19yTbzKZ+AAAAOrO+z+Co1toFSS6YxSb7J7mytXZVklTVB5IclmQ0DB6W5D2ttZbkvKravqp2TvKDJI9L8rz+2LcmuXU+9QMAAAzVnMJgVR2cLtjtnG5m0S+01s6cwaa7JLlm5Pm63NHrt6l1dklyW5L1Sd5VVb+U5MIkL26t/WCK+o5IckSS3P/+95/JSwIAABiUWU0gU1X3q6ovJPnXJC9M8tj+9+lVdX5V7TLdLqZoazNcZ4skD0tyXGvtoel6Co+e6iCttRNaa/u11vZbvXr1NCUBAAAMz2xnEz0hXW/gY1pr922tPaS1dt90ofC+Sd4+zfbr0t2wfsKuSa6d4TrrkqxrrX2hb/9gunAIAADALM02DP5qkpe31j432tha+/d0vXQHTrP9BUn2rKo9+glgnpXk1EnrnJrk96rzyCQ3ttaua619M8k1VfWgfr1fy52vNQQAAGCGZnvN4LeS/Ggjy36U5Nub2ri1dltVvTDJ6eluLXFia+3Sqnp+v/z4JKelu63EleluLfH7I7v44yTv64PkVZOWAQAAMEOzDYN/m+SvqurC1tq6icaq2jXJq5K8drodtNZOSxf4RtuOH3nckhy1kW0vTrLfLGsGAABgkmnDYFWdMqlphyRfr6qLklyf5D7prt27Psmvp7uuEAAAgGVsJj2Dk6fjvKL/SZJ7Jrk5ycQ1hDsuUF0AAAAsomnDYGttuklhAJIka764Zs7bHnngMQtYCQAA05ntbKJ3UlVbLlQhAAAAjM+sw2BVPbqqPllVNyW5uapuqqrTqupRi1AfAAAAi2BWs4lW1UFJPpHkq0nemO5WEzsleVqSc6rqN1prn1rwKgEAAFhQs721xGvT3RT+6f0tICb8VVV9KN2tJ4RBAACAZW62w0T3SfKOSUFwwgn9cgAAAJa52YbBDUl+biPL/ke/HAAAgGVutmHwn5O8rqoOr6qtkqSqtqqqw9MNIZ18g3oAAACWodleM/jnSXZIclKSk6rq+0m27Ze9v18OAADAMjerMNha+1GS36mqv07yiCQ7J7kuyQWtta8sQn0AAAAsghmHwX5Y6I1Jntla+2gS4Q8AAGAzNeNrBltrNye5Pslti1cOAAAA4zDbCWTenuRFVbXlYhQDAADAeMx2Apntk+yd5OqqOivJt5KM3nOwtdZMIgMAALDMzTYMPjXJLf3jx06xvMWMogAAAMvejMJgVW2d5IlJ/m+Sbyb5VGvtW4tZGAAAAItn2jBYVQ9M8qkku48031hVz2ytnbFYhQEAALB4ZjKBzBuS3J5uWOg2SX4xycXpJpMBAABgMzSTMPioJP+7tfbvrbWbW2uXJ/mjJPevqp0XtzwAAAAWw0zC4M5JrprU9vUkleS+C14RAAAAi26m9xls068CAADA5mKmt5Y4vapum6L9rMntrbX7zL8sAAAAFtNMwuBrFr0KAAAAxmraMNhaEwYBAABWmJleMwgAAMAKIgwCAAAMkDAIAAAwQMIgAADAAAmDAAAAAyQMAgAADJAwCAAAMEDCIAAAwAAJgwAAAAMkDAIAAAyQMAgAADBAwiAAAMAACYMAAAADJAwCAAAMkDAIAAAwQFssdQEwI2e/bqkrAACAFUXPIAAAwAAJgwAAAAMkDAIAAAyQawaB5WG+14UeeMzC1AEAMBB6BgEAAAZIGAQAABggYRAAAGCAhEEAAIABMoEMm4U1X1yz1CUAAMCKomcQAABggIRBAACAARIGAQAABkgYBAAAGCBhEAAAYICEQQAAgAESBgEAAAbIfQaBZWG+95I88sBjFqgSAIBh0DMIAAAwQMIgAADAAAmDAAAAAyQMAgAADJAwCAAAMEDCIAAAwAAJgwAAAAMkDAIAAAyQMAgAADBAwiAAAMAACYMAAAADJAwCAAAMkDAIAAAwQMIgAADAAAmDAAAAAyQMAgAADJAwCAAAMEDCIAAAwAAJgwAAAAM09jBYVYdU1Ver6sqqOnqK5VVVb+2Xf6mqHjZp+aqq+o+q+vj4qgYAAFhZxhoGq2pVkrclOTTJXkmeXVV7TVrt0CR79j9HJDlu0vIXJ7l8kUsFAABY0cbdM7h/kitba1e11m5N8oEkh01a57Ak72md85JsX1U7J0lV7ZrkN5K8c5xFAwAArDTjDoO7JLlm5Pm6vm2m67wlycuT3L6pg1TVEVW1tqrWrl+/fl4FAwAArETjDoM1RVubyTpV9aQk17fWLpzuIK21E1pr+7XW9lu9evVc6gQAAFjRthjz8dYl2W3k+a5Jrp3hOk9L8uSqemKSrZLcs6re21o7fBHrBTYT+5y0z5y3veS5lyxgJQAAm4dx9wxekGTPqtqjqu6e5FlJTp20zqlJfq+fVfSRSW5srV3XWjumtbZra233frt/EwQBAADmZqw9g62126rqhUlOT7IqyYmttUur6vn98uOTnJbkiUmuTPLDJL8/zhqBzdMLvrthqUsAANisjHuYaFprp6ULfKNtx488bkmOmmYf5yQ5ZxHKAwAAGISx33QeAACApScMAgAADJAwCAAAMEDCIAAAwAAJgwAAAAMkDAIAAAyQMAgAADBAwiAAAMAACYMAAAADJAwCAAAMkDAIAAAwQMIgAADAAAmDAAAAAyQMAgAADJAwCAAAMEDCIAAAwAAJgwAAAAMkDAIAAAyQMAgAADBAwiAAAMAACYMAAAADJAwCAAAMkDAIAAAwQMIgAADAAAmDAAAAA7TFUhcAsNT2OWmfOW97yXMvWcBKAADGR88gAADAAAmDAAAAA2SYKDB4L/juhqUuAQBg7PQMAgAADJAwCAAAMEDCIAAAwAC5ZpDxOPt1S10BAAAwQs8gAADAAAmDAAAAAyQMAgAADJAwCAAAMEDCIAAAwAAJgwAAAAMkDAIAAAyQMAgAADBAwiAAAMAACYMAAAADJAwCAAAMkDAIAAAwQMIgAADAAAmDAAAAAyQMAgAADJAwCAAAMEDCIAAAwAAJgwAAAAO0xVIXALBZO/t1c9/2wGMWrg4AgFkSBgHmYc0X18x52yOFQQBgCRkmCgAAMEDCIAAAwAAJgwAAAAMkDAIAAAyQMAgAADBAwiAAAMAACYMAAAAD5D6DjMV87sUGAAAsPD2DAAAAAyQMAgAADJAwCAAAMEDCIAAAwAAJgwAAAAMkDAIAAAyQMAgAADBAwiAAAMAACYMAAAADJAwCAAAMkDAIAAAwQMIgAADAAAmDAAAAAyQMAgAADJAwCAAAMEBbLHUBAIN19uvmt/2BxyxMHQDAIOkZBAAAGKCxh8GqOqSqvlpVV1bV0VMsr6p6a7/8S1X1sL59t6o6u6our6pLq+rF464dAABgpRhrGKyqVUneluTQJHsleXZV7TVptUOT7Nn/HJHkuL79tiR/1lp7cJJHJjlqim0BAACYgXH3DO6f5MrW2lWttVuTfCDJYZPWOSzJe1rnvCTbV9XOrbXrWmsXJUlr7aYklyfZZZzFAwAArBTjDoO7JLlm5Pm63DXQTbtOVe2e5KFJvjDVQarqiKpaW1Vr169fP9+aAQAAVpxxh8Gaoq3NZp2q2jbJh5K8pLX2vakO0lo7obW2X2ttv9WrV8+5WAAAgJVq3LeWWJdkt5Hnuya5dqbrVNWW6YLg+1prH17EOgEW3ZovrpnX9ke6tQQAMA/j7hm8IMmeVbVHVd09ybOSnDppnVOT/F4/q+gjk9zYWruuqirJPya5vLX25vGWDQAAsLKMtWewtXZbVb0wyelJViU5sbV2aVU9v19+fJLTkjwxyZVJfpjk9/vND0jyu0kuqaqL+7ZXtNZOG+NLAAAAWBHGPUw0fXg7bVLb8SOPW5Kjptjus5n6ekIAAABmaew3nQcAAGDpCYMAAAADJAwCAAAMkDAIAAAwQMIgAADAAAmDAAAAAzT2W0sAsEDOft3ctz3wmIWrAwDYLOkZBAAAGCA9gwCbqTVfXDPnbY/UMwgAg6dnEAAAYICEQQAAgAESBgEAAAZIGAQAABggYRAAAGCAhEEAAIABEgYBAAAGyH0GmbmzX7fUFQAAAAtEzyAAAMAA6RkEGKL59PQfeMzC1QEALBlhEGCA1nxxzZy3PVIYBIAVwTBRAACAARIGAQAABkgYBAAAGCBhEAAAYICEQQAAgAESBgEAAAZIGAQAABgg9xkEYFb2OWmfOW97yXMvWcBKAID50DMIAAAwQHoGAZiVF3x3w1KXAAAsAD2DAAAAAyQMAgAADJAwCAAAMEDCIAAAwAAJgwAAAAMkDAIAAAyQW0sAMDZuWA8Ay4eeQQAAgAHSMwjA2LhhPQAsH3oGAQAABkgYBAAAGCBhEAAAYIBcMwjAZmE+M5EmZiMFgMmEQQA2CyafAYCFZZgoAADAAOkZZMbWfHHNUpcAAAAsEGEQgGE4+3Vz3/bAYxauDgBYJgwTBQAAGCA9gwAMwnyGuh+pZxCAFUjPIAAAwAAJgwAAAANkmCgATGPNW3ab87ZHvuSaBawEABaOnkEAAIAB0jMIAIvJLS0AWKaEQQBYRPOZxTTz2TaGqAKwaYaJAgAADJAwCAAAMECGiQ7NfK5dAQAAVgxhEABWKpPXALAJwiAArFDzmbzmSGEQYMVzzSAAAMAACYMAAAADJAwCAAAMkGsGAYC7WPOW3ea8rZvdA2we9AwCAAAMkDAIAAAwQMIgAADAALlmEABYWG52D7BZ0DMIAAAwQHoGAYAFteaLa+a87ZF6BgHGRs8gAADAAOkZBACWDfc3BBgfPYMAAAADpGcQAFgZ5jOLaWImU2BwhEEAYEWYz8Q1iclrgOERBgEAEvdHBAZHGAQAiFtiAMMjDAIAzNO8ZkH9pSPnfmAhFJiHsYfBqjokyT8kWZXkna21Yyctr375E5P8MMnzWmsXzWRbAIDNzbyudZzndZJL5bh7bz/nbS957iULVwgM3FjDYFWtSvK2JAclWZfkgqo6tbV22chqhybZs//55STHJfnlGW4LAMAy94LvbpjztvPphZ0v97JkpRl3z+D+Sa5srV2VJFX1gSSHJRkNdIcleU9rrSU5r6q2r6qdk+w+g20HYZ+T9pnztvP5xxcAYMiWMoiyedjc/sNg3GFwlySj79C6dL1/062zywy3TZJU1RFJjuiffr+qvjqPmhfLjkm+Pe6DHjXuA7JUluT8YjCcXywm5xeLyfnFojrqT2q5nmMPmKpx3GGwpmhrM1xnJtt2ja2dkOSE2ZU2XlW1trW231LXwcrk/GIxOb9YTM4vFpPzi8W2uZ1j4w6D65KM9q/vmuTaGa5z9xlsCwAAwAzcbczHuyDJnlW1R1XdPcmzkpw6aZ1Tk/xedR6Z5MbW2nUz3BYAAIAZGGvPYGvttqp6YZLT090e4sTW2qVV9fx++fFJTkt3W4kr091a4vc3te04619gy3oYK5s95xeLyfnFYnJ+sZicXyy2zeocq27STgAAAIZk3MNEAQAAWAaEQQAAgAESBsesqg6pqq9W1ZVVdfRS18Pmr6pOrKrrq+rLI20/W1VnVtUV/e97L2WNbJ6qareqOruqLq+qS6vqxX2784sFUVVbVdX5VfXF/hx7Td/uHGNBVNWqqvqPqvp4/9y5xYKpqqur6pKquriq1vZtm9U5JgyOUVWtSvK2JIcm2SvJs6tqr6WtihXg3UkOmdR2dJKzWmt7Jjmrfw6zdVuSP2utPTjJI5Mc1f+b5fxiodyS5Fdba7+UZN8kh/QziTvHWCgvTnL5yHPnFgvtwNbaviP3FtyszjFhcLz2T3Jla+2q1tqtST6Q5LAlronNXGvtM0m+M6n5sCQn9Y9PSvKUcdbEytBau661dlH/+KZ0f1DtEucXC6R1vt8/3bL/aXGOsQCqatckv5HknSPNzi0W22Z1jgmD47VLkmtGnq/r22Ch7dTfnzP97/sscT1s5qpq9yQPTfKFOL9YQP0wvouTXJ/kzNaac4yF8pYkL09y+0ibc4uF1JKcUVUXVtURfdtmdY6N9T6DpKZoc28PYFmrqm2TfCjJS1pr36ua6p8ymJvW2k+S7FtV2yf5SFXtvcQlsQJU1ZOSXN9au7CqHr/E5bByHdBau7aq7pPkzKr6ylIXNFt6BsdrXZLdRp7vmuTaJaqFle1bVbVzkvS/r1/iethMVdWW6YLg+1prH+6bnV8suNbahiTnpLsG2jnGfB2Q5MlVdXW6y3J+tareG+cWC6i1dm3/+/okH0l3SdhmdY4Jg+N1QZI9q2qPqrp7kmclOXWJa2JlOjXJc/vHz03ysSWshc1UdV2A/5jk8tbam0cWOb9YEFW1uu8RTFVtneTXk3wlzjHmqbV2TGtt19ba7un+3vq31trhcW6xQKrqHlW13cTjJAcn+XI2s3OsWjNKcZyq6onpxrCvSnJia+21S1sRm7uqen+SxyfZMcm3krwqyUeTnJLk/kn+O8nTW2uTJ5mBTaqqxyQ5N8klueOam1eku27Q+cW8VdVD0k2wsCrdf1Cf0lr7q6raIc4xFkg/TPSlrbUnObdYKFX1wHS9gUl36d3JrbXXbm7nmDAIAAAwQIaJAgAADJAwCAAAMEDCIAAAwAAJgwAAAAMkDAIAAAyQMAjAoqqqV1dVq6rTp1j2wao6Z4y1PL6vZe9xHXM2qurBVXVuVf2gr3P3pa5pKlW1bV/f85a6FgDmThgEYFwOrqpHLHURy9wbk2yf5MlJHpXkuiWtBoAVTRgEYBy+k+RLSf5iqQtZTFW11Tx38QtJzmytndVaO6+1dstC1DUX1Znv6wFgGRMGARiHluRvkzy5qvbZ2Er9kNJvT9HequqFI8+vrqo3VdXRVXVdVd1YVX/XB5gnVtWlVXVTVX20qu49xaHuV1Uf74dj/ndVPX+KYz6mqj5dVT+sqhuq6h1Vtd3I8uf1de1fVedU1Y+SvGwTr23fqjqr3993q+p9VbVTv2z3qmpJfi7Jn/T7PWcj+3nP6JDbqnpQv/6HRtoe3rftOdL2wqq6oqpuqaorq+pPJu331VX17f51X5Dk5iRP75c9taq+VlU/qqrPpAutk+t6clVd2L+n362qL1TVr2zs/QBg6QmDAIzLPyf5Whaud/BZSfZP8vtJ3pDkT5O8OclfJ/nLJM9P8itJXjfFtv+Yrqfyt5N8MslxVfWkiYVVdUCSs5J8M8nTkrwkyROTvGuKfb0/ycf75R+fqtCqWp3knCTbJHlOkj/uazuzqu6ebjjoo/rjndw/PnIjr/szSR5dVav6549LF9weO7LO45J8q7V2RX/8P0zyf5KcmuQ3030Wf1dVR0/a9zZJTkryziSHJDm/qh6W5J+SfDHd+3VqklMmvb6fS/LBJP/W7/93+vfiZzfyGgBYBrZY6gIAGIbW2u1VdWySf6yqV7bWvjbPXd6c5OmttZ8k+deqOixdyNqztfafSVJVv5TkuemC4ahPttZe0T8+vaoemOR/544wd2ySz7XWnjmxQVV9I8lZVbV3a+3LI/t6a2vtH6ap9c/6309orX2v39/XknwhyVNba+9Pcl5V3ZLkutbaeZvY17lJtk3y0CRr04XAk5L8z6r6hdbaV/q2c/vj3C3Jq5O8u7U2UccZVXWvJMdU1Vtaazf37Vsn+dPW2sdGXvcp6UL8M1prLcknq+pnkvzNSE0PTXJTa220Z/S0ad4TAJaYnkEAxum9Sf47yTELsK9z+iA44cokV08EwZG21X3v26iPTHr+4SQPr6pVVbVNup65U6pqi4mfJJ9N8uMkD5+07SdmUOv+Sc6YCIJJ0lo7P8nVSR4zg+1/qrX21STX546ewMel6928aKTtMenDYJJdk9wvXW/gqH9Kcs8ko8N2W7+vybWf2gfBCR+etM4lSe5VVSdV1cFVdY/ZvCYAloYwCMDYtNZuSzek8/CqesA8d7dh0vNbN9JWSSaHweuneL5Fkh2T3DvJqiRr0oW/iZ9bkmyZZLdJ235rBrXuvJH1vpW5DaU8N8ljq2q3JPdPF1Qn2h6cZHXuCIM7b6TOieejx/9ua+3WSevdN1O/Xz/VB9TDkjwwXY/gt6vq5H54LADLlDAIwLidmC5M/PkUy27OpOC2kQlg5us+Uzy/Lcm30wXKluRVSR4xxc+Jk7Ztmd51UxwzSXZKN9PqbJ2brvfvcUkua63d0Lc9tm/7XrprIieOnSmOv1P/e/T4U72Wb06x7V1eS2vtE621xybZIcn/TPLr6a5TBGCZEgYBGKv+dglvSvIHuaPXasK6JNtV1S4jbQcvQhm/NcXzC1trP2mt/SDJeUke1FpbO8XPtXM43heSPGHSbKSPSLJ7ul692To3Xe/fEekmlJloe0C6CWo+NzKEdl2Sa9PPDDriGelC4yXTHOuCdLPA1kjbb29s5dbaja21k9MNxd1r+pcCwFIxgQwAS+HtSV6R5NFJPj3S/q9JfpTkxKr6uyR75K6TvyyEQ6vqtf2xfzvJQemGOU54ebrJYm5PN0vmTemGY/5Gkr+Yw+Q3b07ygnST1bw+3QQwx6YLYh/a1IYbcXG6IPe4JMclSWvtO1V1Wd/20xlb+4l7Xp3k7VV1Q5Iz081k+oIkrxiZPGZjXp8uzJ5SVf+YZO90PX8/VVV/lO46y39NFzz3TBc+3zOH1wbAmOgZBGDsWms/TPL3U7R/O8lT00168tEkh6fr6Vpo/yvJw/pjPCnJUa21U0fq+Gy6ULU6yf9L8i/pAuI1mdk1gnfSWluf5MB0w2Dfn+Rt6XryDpriGr2Z7O/2JJ/rn35mZNHEdYKfnbT+O5K8KF0P6MeTPDvJn7XWjp3Bsdamu43HQ9O9X09J8sxJq30p3Xv15iRnpJuZ9R2ZeigwAMtE3XlyMAAAAIZAzyAAAMAACYMAAAADJAwCAAAMkDAIAAAwQMIgAADAAAmDAAAAAyQMAgAADJAwCAAAMED/H28/4hcEbqSSAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 1080x720 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "dist_train = train_qs.apply(lambda x: len(x.split(' ')))\n",
    "dist_test = test_qs.apply(lambda x: len(x.split(' ')))\n",
    "\n",
    "plt.figure(figsize=(15, 10))\n",
    "plt.hist(dist_train, bins=50, range=[0, 50], color=pal[2], density=True, label='train')\n",
    "plt.hist(dist_test, bins=50, range=[0, 50], color=pal[1], density=True, alpha=0.5, label='test')\n",
    "plt.title('Normalised histogram of word count in questions', fontsize=15)\n",
    "plt.legend()\n",
    "plt.xlabel('Number of words', fontsize=15)\n",
    "plt.ylabel('Probability', fontsize=15)\n",
    "\n",
    "print('mean-train {:.2f} std-train {:.2f} mean-test {:.2f} std-test {:.2f} max-train {:.2f} max-test {:.2f}'.format(dist_train.mean(), \n",
    "                          dist_train.std(), dist_test.mean(), dist_test.std(), dist_train.max(), dist_test.max()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们在字数上看到了类似的分布，大多数问题的长度都在10个字左右。在我看来，训练集的分布似乎更“尖锐”，而在测试集上则更为广泛。然而，它们非常相似。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3 语义分析\n",
    "接下来，我将看一看不同标点符号在问句中的用法——这可能会为以后的一些有趣的特性奠定基础。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:32.713837Z",
     "iopub.status.busy": "2021-07-10T11:53:32.713558Z",
     "iopub.status.idle": "2021-07-10T11:53:43.071057Z",
     "shell.execute_reply": "2021-07-10T11:53:43.070085Z",
     "shell.execute_reply.started": "2021-07-10T11:53:32.713809Z"
    }
   },
   "outputs": [],
   "source": [
    "qmarks = np.mean(train_qs.apply(lambda x: '?' in x))\n",
    "math = np.mean(train_qs.apply(lambda x: '[math]' in x))\n",
    "fullstop = np.mean(train_qs.apply(lambda x: '.' in x))\n",
    "capital_first = np.mean(train_qs.apply(lambda x: x[0].isupper()))\n",
    "capitals = np.mean(train_qs.apply(lambda x: max([y.isupper() for y in x])))\n",
    "numbers = np.mean(train_qs.apply(lambda x: max([y.isdigit() for y in x])))\n",
    "\n",
    "print('Questions with question marks: {:.2f}%'.format(qmarks * 100))\n",
    "print('Questions with [math] tags: {:.2f}%'.format(math * 100))\n",
    "print('Questions with full stops: {:.2f}%'.format(fullstop * 100))\n",
    "print('Questions with capitalised first letters: {:.2f}%'.format(capital_first * 100))\n",
    "print('Questions with capital letters: {:.2f}%'.format(capitals * 100))\n",
    "print('Questions with numbers: {:.2f}%'.format(numbers * 100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5 特征工程\n",
    "特征工程上重点是从question1和question2上提取一些有用的信息变成新的变量加入到数据集中。在这里我借鉴了crowdflower竞赛中1st选手的思路同时也结合本题目中的一些特点，我将引入特征分类如下：\n",
    "1. ngram计数类的：\n",
    " 'count_of_question1_unigram',\n",
    " 'count_of_unique_question1_unigram',\n",
    " 'count_of_question1_bigram',\n",
    " 'count_of_unique_question1_bigram',\n",
    " 'count_of_question1_trigram',\n",
    " 'count_of_unique_question1_trigram',\n",
    " 'count_of_question2_unigram',\n",
    " 'count_of_unique_question2_unigram',\n",
    " 'count_of_question2_bigram',\n",
    " 'count_of_unique_question2_bigram',\n",
    " 'count_of_question2_trigram',\n",
    " 'count_of_unique_question2_trigram',\n",
    " 'count_of_question1_unigram_in_question2',\n",
    " 'count_of_question2_unigram_in_question1',\n",
    " 'count_of_question1_bigram_in_question2',\n",
    " 'count_of_question2_bigram_in_question1',\n",
    " 'count_of_question1_trigram_in_question2',\n",
    " 'count_of_question2_trigram_in_question1',\n",
    "2. 距离类的：\n",
    " 'jaccard_coef_of_unigram_between_question1_question2',\n",
    " 'jaccard_coef_of_bigram_between_question1_question2',\n",
    " 'jaccard_coef_of_trigram_between_question1_question2',\n",
    " 'dice_dist_of_unigram_between_question1_question2',\n",
    " 'dice_dist_of_bigram_between_question1_question2',\n",
    " 'dice_dist_of_trigram_between_question1_question2',\n",
    "3. TFIDF类的：\n",
    " 'tfidf_word_match',\n",
    "4. 相关性类的：\n",
    " 'word_match_share'\n",
    " 'word_count_diff'\n",
    " \n",
    "### 5.1 ngram计数类的特征提取"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:43.072741Z",
     "iopub.status.busy": "2021-07-10T11:53:43.072412Z",
     "iopub.status.idle": "2021-07-10T11:53:44.068936Z",
     "shell.execute_reply": "2021-07-10T11:53:44.067894Z",
     "shell.execute_reply.started": "2021-07-10T11:53:43.072713Z"
    }
   },
   "outputs": [],
   "source": [
    "import re\n",
    "import sys\n",
    "import nltk\n",
    "import numpy as np\n",
    "from bs4 import BeautifulSoup\n",
    "#from replacer import CsvWordReplacer\n",
    "from nltk import pos_tag\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\n",
    "#sys.path.append(\"../\")\n",
    "#from param_config import config\n",
    "\n",
    "##########################\n",
    "## Synonym Replacement ##\n",
    "##########################\n",
    "class WordReplacer(object):\n",
    "    def __init__(self, word_map):\n",
    "        self.word_map = word_map\n",
    "    def replace(self, word):\n",
    "        return [self.word_map.get(w, w) for w in word]\n",
    "    \n",
    "    \n",
    "class CsvWordReplacer(WordReplacer):\n",
    "    def __init__(self, fname):\n",
    "        word_map = {}\n",
    "        for line in csv.reader(open(fname)):\n",
    "            word, syn = line\n",
    "            if word.startswith(\"#\"):\n",
    "                continue\n",
    "            word_map[word] = syn\n",
    "        super(CsvWordReplacer, self).__init__(word_map)\n",
    "################\n",
    "## Stop Words ##\n",
    "################\n",
    "stopwords = nltk.corpus.stopwords.words(\"english\")\n",
    "stopwords = set(stopwords)\n",
    "\n",
    "\n",
    "##############\n",
    "## Stemming ##\n",
    "##############\n",
    "#if config.stemmer_type == \"porter\":\n",
    "#    english_stemmer = nltk.stem.PorterStemmer()\n",
    "#elif config.stemmer_type == \"snowball\":\n",
    "english_stemmer = nltk.stem.SnowballStemmer('english')\n",
    "\n",
    "def stem_tokens(tokens, stemmer):\n",
    "    stemmed = []\n",
    "    for token in tokens:\n",
    "        stemmed.append(stemmer.stem(token))\n",
    "    return stemmed\n",
    "\n",
    "def try_divide(x, y, val=0.0):\n",
    "    \"\"\" \n",
    "    \tTry to divide two numbers\n",
    "    \"\"\"\n",
    "    if y != 0.0:\n",
    "    \tval = float(x) / y\n",
    "    return val\n",
    "\n",
    "def dump_feat_name(feat_names, feat_name_file):\n",
    "\t\"\"\"\n",
    "\t\tsave feat_names to feat_name_file\n",
    "\t\"\"\"\n",
    "\twith open(feat_name_file, \"wb\") as f:\n",
    "\t    for i,feat_name in enumerate(feat_names):\n",
    "\t        if feat_name.startswith(\"count\") or feat_name.startswith(\"pos_of\"):\n",
    "\t            f.write(\"('%s', SimpleTransform(config.count_feat_transform)),\\n\" % feat_name)\n",
    "\t        else:\n",
    "\t            f.write(\"('%s', SimpleTransform()),\\n\" % feat_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:44.070588Z",
     "iopub.status.busy": "2021-07-10T11:53:44.07027Z",
     "iopub.status.idle": "2021-07-10T11:53:44.077133Z",
     "shell.execute_reply": "2021-07-10T11:53:44.076023Z",
     "shell.execute_reply.started": "2021-07-10T11:53:44.070557Z"
    }
   },
   "outputs": [],
   "source": [
    "def get_sample_indices_by_relevance(dfTrain, additional_key=None):\n",
    "\t\"\"\" \n",
    "\t\treturn a dict with\n",
    "\t\tkey: (additional_key, median_relevance)\n",
    "\t\tval: list of sample indices\n",
    "\t\"\"\"\n",
    "\tdfTrain[\"sample_index\"] = range(dfTrain.shape[0])\n",
    "\tgroup_key = [\"median_relevance\"]\n",
    "\tif additional_key != None:\n",
    "\t\tgroup_key.insert(0, additional_key)\n",
    "\tagg = dfTrain.groupby(group_key, as_index=False).apply(lambda x: list(x[\"sample_index\"]))\n",
    "\td = dict(agg)\n",
    "\tdfTrain = dfTrain.drop(\"sample_index\", axis=1)\n",
    "\treturn d"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:44.078705Z",
     "iopub.status.busy": "2021-07-10T11:53:44.078388Z",
     "iopub.status.idle": "2021-07-10T11:53:44.097952Z",
     "shell.execute_reply": "2021-07-10T11:53:44.097172Z",
     "shell.execute_reply.started": "2021-07-10T11:53:44.078675Z"
    }
   },
   "outputs": [],
   "source": [
    "stats_feat_flag = False\n",
    "\n",
    "\n",
    "#####################\n",
    "## Distance metric ##\n",
    "#####################\n",
    "def JaccardCoef(A, B):\n",
    "    A, B = set(A), set(B)\n",
    "    intersect = len(A.intersection(B))\n",
    "    union = len(A.union(B))\n",
    "    coef = try_divide(intersect, union)\n",
    "    return coef\n",
    "\n",
    "def DiceDist(A, B):\n",
    "    A, B = set(A), set(B)\n",
    "    intersect = len(A.intersection(B))\n",
    "    union = len(A) + len(B)\n",
    "    d = try_divide(2*intersect, union)\n",
    "    return d\n",
    "\n",
    "def compute_dist(A, B, dist=\"jaccard_coef\"):\n",
    "    if dist == \"jaccard_coef\":\n",
    "        d = JaccardCoef(A, B)\n",
    "    elif dist == \"dice_dist\":\n",
    "        d = DiceDist(A, B)\n",
    "    return d\n",
    "\n",
    "#### pairwise distance\n",
    "def pairwise_jaccard_coef(A, B):\n",
    "    coef = np.zeros((A.shape[0], B.shape[0]), dtype=float)\n",
    "    for i in range(A.shape[0]):\n",
    "        for j in range(B.shape[0]):\n",
    "            coef[i,j] = JaccardCoef(A[i], B[j])\n",
    "    return coef\n",
    "    \n",
    "def pairwise_dice_dist(A, B):\n",
    "    d = np.zeros((A.shape[0], B.shape[0]), dtype=float)\n",
    "    for i in range(A.shape[0]):\n",
    "        for j in range(B.shape[0]):\n",
    "            d[i,j] = DiceDist(A[i], B[j])\n",
    "    return d\n",
    "\n",
    "def pairwise_dist(A, B, dist=\"jaccard_coef\"):\n",
    "    if dist == \"jaccard_coef\":\n",
    "        d = pairwise_jaccard_coef(A, B)\n",
    "    elif dist == \"dice_dist\":\n",
    "        d = pairwise_dice_dist(A, B)\n",
    "    return d\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:44.099891Z",
     "iopub.status.busy": "2021-07-10T11:53:44.099473Z",
     "iopub.status.idle": "2021-07-10T11:53:44.114061Z",
     "shell.execute_reply": "2021-07-10T11:53:44.112875Z",
     "shell.execute_reply.started": "2021-07-10T11:53:44.099849Z"
    }
   },
   "outputs": [],
   "source": [
    "def getUnigram(words):\n",
    "    \"\"\"\n",
    "        Input: a list of words, e.g., ['I', 'am', 'Denny']\n",
    "        Output: a list of unigram\n",
    "    \"\"\"\n",
    "    assert type(words) == list\n",
    "    return words\n",
    "    \n",
    "def getBigram(words, join_string, skip=0):\n",
    "\t\"\"\"\n",
    "\t   Input: a list of words, e.g., ['I', 'am', 'Denny']\n",
    "\t   Output: a list of bigram, e.g., ['I_am', 'am_Denny']\n",
    "\t   I use _ as join_string for this example.\n",
    "\t\"\"\"\n",
    "\tassert type(words) == list\n",
    "\tL = len(words)\n",
    "\tif L > 1:\n",
    "\t\tlst = []\n",
    "\t\tfor i in range(L-1):\n",
    "\t\t\tfor k in range(1,skip+2):\n",
    "\t\t\t\tif i+k < L:\n",
    "\t\t\t\t\tlst.append( join_string.join([words[i], words[i+k]]) )\n",
    "\telse:\n",
    "\t\t# set it as unigram\n",
    "\t\tlst = getUnigram(words)\n",
    "\treturn lst\n",
    "    \n",
    "def getTrigram(words, join_string, skip=0):\n",
    "\t\"\"\"\n",
    "\t   Input: a list of words, e.g., ['I', 'am', 'Denny']\n",
    "\t   Output: a list of trigram, e.g., ['I_am_Denny']\n",
    "\t   I use _ as join_string for this example.\n",
    "\t\"\"\"\n",
    "\tassert type(words) == list\n",
    "\tL = len(words)\n",
    "\tif L > 2:\n",
    "\t\tlst = []\n",
    "\t\tfor i in range(L-2):\n",
    "\t\t\tfor k1 in range(1,skip+2):\n",
    "\t\t\t\tfor k2 in range(1,skip+2):\n",
    "\t\t\t\t\tif i+k1 < L and i+k1+k2 < L:\n",
    "\t\t\t\t\t\tlst.append( join_string.join([words[i], words[i+k1], words[i+k1+k2]]) )\n",
    "\telse:\n",
    "\t\t# set it as bigram\n",
    "\t\tlst = getBigram(words, join_string, skip)\n",
    "\treturn lst"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:44.115351Z",
     "iopub.status.busy": "2021-07-10T11:53:44.115073Z",
     "iopub.status.idle": "2021-07-10T11:53:44.131587Z",
     "shell.execute_reply": "2021-07-10T11:53:44.130572Z",
     "shell.execute_reply.started": "2021-07-10T11:53:44.115325Z"
    }
   },
   "outputs": [],
   "source": [
    "def get_position_list(target, obs):\n",
    "    \"\"\"\n",
    "        Get the list of positions of obs in target\n",
    "    \"\"\"\n",
    "    pos_of_obs_in_target = [0]\n",
    "    if len(obs) != 0:\n",
    "        pos_of_obs_in_target = [j for j,w in enumerate(obs, start=1) if w in target]\n",
    "        if len(pos_of_obs_in_target) == 0:\n",
    "            pos_of_obs_in_target = [0]\n",
    "    return pos_of_obs_in_target\n",
    "\n",
    "######################\n",
    "## Pre-process data ##\n",
    "######################\n",
    "token_pattern = r\"(?u)\\b\\w\\w+\\b\"\n",
    "#token_pattern = r'\\w{1,}'\n",
    "#token_pattern = r\"\\w+\"\n",
    "#token_pattern = r\"[\\w']+\"\n",
    "def preprocess_data(line,\n",
    "                    token_pattern=token_pattern,\n",
    "                    exclude_stopword=False, #config.cooccurrence_word_exclude_stopword,\n",
    "                    encode_digit=False):\n",
    "    token_pattern = re.compile(token_pattern)#, flags = re.UNICODE | re.LOCALE)\n",
    "    ## tokenize\n",
    "    tokens = [x.lower() for x in token_pattern.findall(line)]\n",
    "    ## stem\n",
    "    tokens_stemmed = stem_tokens(tokens, english_stemmer)\n",
    "    if exclude_stopword:\n",
    "        tokens_stemmed = [x for x in tokens_stemmed if x not in stopwords]\n",
    "    return tokens_stemmed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:44.133198Z",
     "iopub.status.busy": "2021-07-10T11:53:44.132846Z",
     "iopub.status.idle": "2021-07-10T11:53:44.14932Z",
     "shell.execute_reply": "2021-07-10T11:53:44.148153Z",
     "shell.execute_reply.started": "2021-07-10T11:53:44.133168Z"
    }
   },
   "outputs": [],
   "source": [
    "def extract_feat(df):\n",
    "    extract_feat_ngram(df)\n",
    "    extract_feat_word_count(df)\n",
    "    extract_feat_intersect_word_count(df)\n",
    "    #extract_feat_intersect_word_pos(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:44.151123Z",
     "iopub.status.busy": "2021-07-10T11:53:44.150755Z",
     "iopub.status.idle": "2021-07-10T11:53:44.163349Z",
     "shell.execute_reply": "2021-07-10T11:53:44.162277Z",
     "shell.execute_reply.started": "2021-07-10T11:53:44.151092Z"
    }
   },
   "outputs": [],
   "source": [
    " def extract_feat_ngram(df):\n",
    "    ## unigram\n",
    "    print(\"generate unigram\")\n",
    "    df[\"question1_unigram\"] = list(df.apply(lambda x: preprocess_data(x[\"question1\"]), axis=1))\n",
    "    df[\"question2_unigram\"] = list(df.apply(lambda x: preprocess_data(x[\"question2\"]), axis=1))\n",
    "    ## bigram\n",
    "    print(\"generate bigram\")\n",
    "    join_str = \"_\"\n",
    "    df[\"question1_bigram\"] = list(df.apply(lambda x: getBigram(x[\"question1_unigram\"], join_str), axis=1))\n",
    "    df[\"question2_bigram\"] = list(df.apply(lambda x: getBigram(x[\"question2_unigram\"], join_str), axis=1))\n",
    "    ## trigram\n",
    "    print(\"generate trigram\")\n",
    "    join_str = \"_\"\n",
    "    df[\"question1_trigram\"] = list(df.apply(lambda x: getTrigram(x[\"question1_unigram\"], join_str), axis=1))\n",
    "    df[\"question2_trigram\"] = list(df.apply(lambda x: getTrigram(x[\"question2_unigram\"], join_str), axis=1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:44.165035Z",
     "iopub.status.busy": "2021-07-10T11:53:44.164692Z",
     "iopub.status.idle": "2021-07-10T11:53:44.180513Z",
     "shell.execute_reply": "2021-07-10T11:53:44.179488Z",
     "shell.execute_reply.started": "2021-07-10T11:53:44.164996Z"
    }
   },
   "outputs": [],
   "source": [
    "def extract_feat_word_count(df):\n",
    "    ################################\n",
    "    ## word count and digit count ##\n",
    "    ################################\n",
    "    print(\"generate word counting features\")\n",
    "    feat_names = [\"question1\", \"question2\"]\n",
    "    grams = [\"unigram\", \"bigram\", \"trigram\"]\n",
    "    #count_digit = lambda x: sum([1. for w in x if w.isdigit()])\n",
    "    for feat_name in feat_names:\n",
    "        for gram in grams:\n",
    "            ## word count\n",
    "            df[\"count_of_%s_%s\"%(feat_name,gram)] = list(df.apply(lambda x: len(x[feat_name+\"_\"+gram]), axis=1))\n",
    "            df[\"count_of_unique_%s_%s\"%(feat_name,gram)] = list(df.apply(lambda x: len(set(x[feat_name+\"_\"+gram])), axis=1))\n",
    "            #df[\"ratio_of_unique_%s_%s\"%(feat_name,gram)] = map(try_divide, df[\"count_of_unique_%s_%s\"%(feat_name,gram)], df[\"count_of_%s_%s\"%(feat_name,gram)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:44.182175Z",
     "iopub.status.busy": "2021-07-10T11:53:44.181867Z",
     "iopub.status.idle": "2021-07-10T11:53:44.194272Z",
     "shell.execute_reply": "2021-07-10T11:53:44.193203Z",
     "shell.execute_reply.started": "2021-07-10T11:53:44.182146Z"
    }
   },
   "outputs": [],
   "source": [
    "def extract_feat_intersect_word_count(df):\n",
    "    ##############################\n",
    "    ## intersect word count ##\n",
    "    ##############################\n",
    "    print(\"generate intersect word counting features\")\n",
    "    feat_names = [\"question1\", \"question2\"]\n",
    "    grams = [\"unigram\", \"bigram\", \"trigram\"]\n",
    "    #### unigram\n",
    "    for gram in grams:\n",
    "        for obs_name in feat_names:\n",
    "            for target_name in feat_names:\n",
    "                if target_name != obs_name:\n",
    "                    ## query\n",
    "                    df[\"count_of_%s_%s_in_%s\"%(obs_name,gram,target_name)] = list(df.apply(lambda x: sum([1. for w in x[obs_name+\"_\"+gram] if w in set(x[target_name+\"_\"+gram])]), axis=1))\n",
    "                    #df[\"ratio_of_%s_%s_in_%s\"%(obs_name,gram,target_name)] = map(try_divide, df[\"count_of_%s_%s_in_%s\"%(obs_name,gram,target_name)], df[\"count_of_%s_%s\"%(obs_name,gram)])\n",
    "\n",
    "        ## some other feat\n",
    "        #df[\"title_%s_in_query_div_query_%s\"%(gram,gram)] = map(try_divide, df[\"count_of_title_%s_in_query\"%gram], df[\"count_of_query_%s\"%gram])\n",
    "        #df[\"title_%s_in_query_div_query_%s_in_title\"%(gram,gram)] = map(try_divide, df[\"count_of_title_%s_in_query\"%gram], df[\"count_of_query_%s_in_title\"%gram])\n",
    "        #df[\"description_%s_in_query_div_query_%s\"%(gram,gram)] = map(try_divide, df[\"count_of_description_%s_in_query\"%gram], df[\"count_of_query_%s\"%gram])\n",
    "        #df[\"description_%s_in_query_div_query_%s_in_description\"%(gram,gram)] = map(try_divide, df[\"count_of_description_%s_in_query\"%gram], df[\"count_of_query_%s_in_description\"%gram])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:44.19609Z",
     "iopub.status.busy": "2021-07-10T11:53:44.195787Z",
     "iopub.status.idle": "2021-07-10T11:53:44.211864Z",
     "shell.execute_reply": "2021-07-10T11:53:44.211015Z",
     "shell.execute_reply.started": "2021-07-10T11:53:44.196056Z"
    }
   },
   "outputs": [],
   "source": [
    "def extract_feat_intersect_word_pos(df):\n",
    "    ######################################\n",
    "    ## intersect word position feat ##\n",
    "    ######################################\n",
    "    print(\"generate intersect word position features\")\n",
    "    feat_names = [\"question1\", \"question2\"]\n",
    "    grams = [\"unigram\", \"bigram\", \"trigram\"]\n",
    "    for gram in grams:\n",
    "        for target_name in feat_names:\n",
    "            for obs_name in feat_names:\n",
    "                if target_name != obs_name:\n",
    "                    pos = list(df.apply(lambda x: get_position_list(x[target_name+\"_\"+gram], obs=x[obs_name+\"_\"+gram]), axis=1))\n",
    "                    ## stats feat on pos\n",
    "                    df[\"pos_of_%s_%s_in_%s_min\" % (obs_name, gram, target_name)] = map(np.min, pos)\n",
    "                    df[\"pos_of_%s_%s_in_%s_mean\" % (obs_name, gram, target_name)] = map(np.mean, pos)\n",
    "                    df[\"pos_of_%s_%s_in_%s_median\" % (obs_name, gram, target_name)] = map(np.median, pos)\n",
    "                    df[\"pos_of_%s_%s_in_%s_max\" % (obs_name, gram, target_name)] = map(np.max, pos)\n",
    "                    df[\"pos_of_%s_%s_in_%s_std\" % (obs_name, gram, target_name)] = map(np.std, pos)\n",
    "                    ## stats feat on normalized_pos\n",
    "                    df[\"normalized_pos_of_%s_%s_in_%s_min\" % (obs_name, gram, target_name)] = map(try_divide, df[\"pos_of_%s_%s_in_%s_min\" % (obs_name, gram, target_name)], df[\"count_of_%s_%s\" % (obs_name, gram)])\n",
    "                    df[\"normalized_pos_of_%s_%s_in_%s_mean\" % (obs_name, gram, target_name)] = map(try_divide, df[\"pos_of_%s_%s_in_%s_mean\" % (obs_name, gram, target_name)], df[\"count_of_%s_%s\" % (obs_name, gram)])\n",
    "                    df[\"normalized_pos_of_%s_%s_in_%s_median\" % (obs_name, gram, target_name)] = map(try_divide, df[\"pos_of_%s_%s_in_%s_median\" % (obs_name, gram, target_name)], df[\"count_of_%s_%s\" % (obs_name, gram)])\n",
    "                    df[\"normalized_pos_of_%s_%s_in_%s_max\" % (obs_name, gram, target_name)] = map(try_divide, df[\"pos_of_%s_%s_in_%s_max\" % (obs_name, gram, target_name)], df[\"count_of_%s_%s\" % (obs_name, gram)])\n",
    "                    df[\"normalized_pos_of_%s_%s_in_%s_std\" % (obs_name, gram, target_name)] = map(try_divide, df[\"pos_of_%s_%s_in_%s_std\" % (obs_name, gram, target_name)] , df[\"count_of_%s_%s\" % (obs_name, gram)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T11:53:44.213699Z",
     "iopub.status.busy": "2021-07-10T11:53:44.213284Z",
     "iopub.status.idle": "2021-07-10T12:01:05.562689Z",
     "shell.execute_reply": "2021-07-10T12:01:05.561586Z",
     "shell.execute_reply.started": "2021-07-10T11:53:44.213668Z"
    }
   },
   "outputs": [],
   "source": [
    "extract_feat(df_train)\n",
    "#extract_feat_ngram(df_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:01:05.564414Z",
     "iopub.status.busy": "2021-07-10T12:01:05.564113Z",
     "iopub.status.idle": "2021-07-10T12:01:05.568938Z",
     "shell.execute_reply": "2021-07-10T12:01:05.567675Z",
     "shell.execute_reply.started": "2021-07-10T12:01:05.564386Z"
    }
   },
   "source": [
    "### 5.2 距离类特征提取"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:01:05.570913Z",
     "iopub.status.busy": "2021-07-10T12:01:05.570569Z",
     "iopub.status.idle": "2021-07-10T12:01:05.58424Z",
     "shell.execute_reply": "2021-07-10T12:01:05.583368Z",
     "shell.execute_reply.started": "2021-07-10T12:01:05.570882Z"
    }
   },
   "outputs": [],
   "source": [
    "#####################################\n",
    "## Extract basic distance features ##\n",
    "#####################################\n",
    "def extract_basic_distance_feat(df):\n",
    "    ## jaccard coef/dice dist of n-gram\n",
    "    print (\"generate jaccard coef and dice dist for n-gram\")\n",
    "    dists = [\"jaccard_coef\", \"dice_dist\"]\n",
    "    grams = [\"unigram\", \"bigram\", \"trigram\"]\n",
    "    feat_names = [\"question1\", \"question2\"]\n",
    "    for dist in dists:\n",
    "        for gram in grams:\n",
    "            for i in range(len(feat_names)-1):\n",
    "                for j in range(i+1,len(feat_names)):\n",
    "                    target_name = feat_names[i]\n",
    "                    obs_name = feat_names[j]\n",
    "                    df[\"%s_of_%s_between_%s_%s\"%(dist,gram,target_name,obs_name)] = \\\n",
    "                            list(df.apply(lambda x: compute_dist(x[target_name+\"_\"+gram], x[obs_name+\"_\"+gram], dist), axis=1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:01:05.586165Z",
     "iopub.status.busy": "2021-07-10T12:01:05.58575Z",
     "iopub.status.idle": "2021-07-10T12:01:05.610656Z",
     "shell.execute_reply": "2021-07-10T12:01:05.60962Z",
     "shell.execute_reply.started": "2021-07-10T12:01:05.586122Z"
    }
   },
   "outputs": [],
   "source": [
    "###########################################\n",
    "## Extract statistical distance features ##\n",
    "###########################################\n",
    "## generate dist stats feat\n",
    "def generate_dist_stats_feat(dist, X_train, ids_train, X_test, ids_test, indices_dict, qids_test=None):\n",
    "\n",
    "    stats_feat = 0 * np.ones((len(ids_test), stats_feat_num*config.n_classes), dtype=float)\n",
    "    ## pairwise dist\n",
    "    distance = pairwise_dist(X_test, X_train, dist)\n",
    "    for i in range(len(ids_test)):\n",
    "        id = ids_test[i]\n",
    "        if qids_test is not None:\n",
    "            qid = qids_test[i]\n",
    "        for j in range(config.n_classes):\n",
    "            key = (qid, j+1) if qids_test is not None else j+1\n",
    "            if indices_dict.has_key(key):\n",
    "                inds = indices_dict[key]\n",
    "                # exclude this sample itself from the list of indices\n",
    "                inds = [ ind for ind in inds if id != ids_train[ind] ]\n",
    "                distance_tmp = distance[i][inds]\n",
    "                if len(distance_tmp) != 0:\n",
    "                    feat = [ func(distance_tmp) for func in stats_func ]\n",
    "                    ## quantile\n",
    "                    distance_tmp = pd.Series(distance_tmp)\n",
    "                    quantiles = distance_tmp.quantile(quantiles_range)\n",
    "                    feat = np.hstack((feat, quantiles))\n",
    "                    stats_feat[i,j*stats_feat_num:(j+1)*stats_feat_num] = feat\n",
    "    return stats_feat\n",
    "\n",
    "\n",
    "def extract_statistical_distance_feat(path, dfTrain, dfTest, mode, feat_names):\n",
    "\n",
    "    new_feat_names = copy(feat_names)\n",
    "    ## get the indices of pooled samples\n",
    "    relevance_indices_dict = get_sample_indices_by_relevance(dfTrain)\n",
    "    query_relevance_indices_dict = get_sample_indices_by_relevance(dfTrain, \"qid\")\n",
    "    ## very time consuming\n",
    "    for dist in [\"jaccard_coef\", \"dice_dist\"]:\n",
    "        for name in [\"title\", \"description\"]:\n",
    "            for gram in [\"unigram\", \"bigram\", \"trigram\"]:\n",
    "                ## train\n",
    "                dist_stats_feat_by_relevance_train = generate_dist_stats_feat(dist, dfTrain[name+\"_\"+gram].values, dfTrain[\"id\"].values,\n",
    "                                                            dfTrain[name+\"_\"+gram].values, dfTrain[\"id\"].values,\n",
    "                                                            relevance_indices_dict)\n",
    "                dist_stats_feat_by_query_relevance_train = generate_dist_stats_feat(dist, dfTrain[name+\"_\"+gram].values, dfTrain[\"id\"].values,\n",
    "                                                                dfTrain[name+\"_\"+gram].values, dfTrain[\"id\"].values,\n",
    "                                                                query_relevance_indices_dict, dfTrain[\"qid\"].values)\n",
    "                with open(\"%s/train.%s_%s_%s_stats_feat_by_relevance.feat.pkl\" % (path, name, gram, dist), \"wb\") as f:\n",
    "                    cPickle.dump(dist_stats_feat_by_relevance_train, f, -1)\n",
    "                with open(\"%s/train.%s_%s_%s_stats_feat_by_query_relevance.feat.pkl\" % (path, name, gram, dist), \"wb\") as f:\n",
    "                    cPickle.dump(dist_stats_feat_by_query_relevance_train, f, -1)\n",
    "                ## test\n",
    "                dist_stats_feat_by_relevance_test = generate_dist_stats_feat(dist, dfTrain[name+\"_\"+gram].values, dfTrain[\"id\"].values,\n",
    "                                                            dfTest[name+\"_\"+gram].values, dfTest[\"id\"].values,\n",
    "                                                            relevance_indices_dict)\n",
    "                dist_stats_feat_by_query_relevance_test = generate_dist_stats_feat(dist, dfTrain[name+\"_\"+gram].values, dfTrain[\"id\"].values,\n",
    "                                                                dfTest[name+\"_\"+gram].values, dfTest[\"id\"].values,\n",
    "                                                                query_relevance_indices_dict, dfTest[\"qid\"].values)\n",
    "                with open(\"%s/%s.%s_%s_%s_stats_feat_by_relevance.feat.pkl\" % (path, mode, name, gram, dist), \"wb\") as f:\n",
    "                    cPickle.dump(dist_stats_feat_by_relevance_test, f, -1)\n",
    "                with open(\"%s/%s.%s_%s_%s_stats_feat_by_query_relevance.feat.pkl\" % (path, mode, name, gram, dist), \"wb\") as f:\n",
    "                    cPickle.dump(dist_stats_feat_by_query_relevance_test, f, -1)\n",
    "\n",
    "                ## update feat names\n",
    "                new_feat_names.append( \"%s_%s_%s_stats_feat_by_relevance\" % (name, gram, dist) )\n",
    "                new_feat_names.append( \"%s_%s_%s_stats_feat_by_query_relevance\" % (name, gram, dist) )\n",
    "\n",
    "    return new_feat_names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:01:05.612153Z",
     "iopub.status.busy": "2021-07-10T12:01:05.611869Z",
     "iopub.status.idle": "2021-07-10T12:02:05.138999Z",
     "shell.execute_reply": "2021-07-10T12:02:05.138066Z",
     "shell.execute_reply.started": "2021-07-10T12:01:05.612127Z"
    }
   },
   "outputs": [],
   "source": [
    "extract_basic_distance_feat(df_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:02:05.140559Z",
     "iopub.status.busy": "2021-07-10T12:02:05.140266Z",
     "iopub.status.idle": "2021-07-10T12:02:05.188945Z",
     "shell.execute_reply": "2021-07-10T12:02:05.187863Z",
     "shell.execute_reply.started": "2021-07-10T12:02:05.140517Z"
    }
   },
   "outputs": [],
   "source": [
    "colst=df_train.select_dtypes(include=['float64','int64']).columns.to_list()\n",
    "colst.append('question1')\n",
    "colst.append('question2')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:02:05.190684Z",
     "iopub.status.busy": "2021-07-10T12:02:05.190367Z",
     "iopub.status.idle": "2021-07-10T12:02:05.197234Z",
     "shell.execute_reply": "2021-07-10T12:02:05.19636Z",
     "shell.execute_reply.started": "2021-07-10T12:02:05.190654Z"
    }
   },
   "outputs": [],
   "source": [
    "colst"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:02:05.198773Z",
     "iopub.status.busy": "2021-07-10T12:02:05.198456Z",
     "iopub.status.idle": "2021-07-10T12:02:05.943583Z",
     "shell.execute_reply": "2021-07-10T12:02:05.942643Z",
     "shell.execute_reply.started": "2021-07-10T12:02:05.198744Z"
    }
   },
   "outputs": [],
   "source": [
    "df_train=df_train[colst]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:02:05.945034Z",
     "iopub.status.busy": "2021-07-10T12:02:05.944726Z",
     "iopub.status.idle": "2021-07-10T12:02:06.145681Z",
     "shell.execute_reply": "2021-07-10T12:02:06.144755Z",
     "shell.execute_reply.started": "2021-07-10T12:02:05.945007Z"
    }
   },
   "outputs": [],
   "source": [
    "df_train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:02:06.147118Z",
     "iopub.status.busy": "2021-07-10T12:02:06.146816Z",
     "iopub.status.idle": "2021-07-10T12:02:16.171414Z",
     "shell.execute_reply": "2021-07-10T12:02:16.170622Z",
     "shell.execute_reply.started": "2021-07-10T12:02:06.147092Z"
    }
   },
   "outputs": [],
   "source": [
    "df_train.to_csv('train_feat.csv',index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:20:25.581089Z",
     "iopub.status.busy": "2021-07-10T12:20:25.58072Z",
     "iopub.status.idle": "2021-07-10T12:20:25.587526Z",
     "shell.execute_reply": "2021-07-10T12:20:25.586741Z",
     "shell.execute_reply.started": "2021-07-10T12:20:25.581059Z"
    }
   },
   "outputs": [],
   "source": [
    "df_test.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.3 对测试集的处理\n",
    "测试集因为太大，需要拆分成多个包再处理，不然内存受不了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:16:56.330068Z",
     "iopub.status.busy": "2021-07-10T12:16:56.329614Z",
     "iopub.status.idle": "2021-07-10T12:17:13.349573Z",
     "shell.execute_reply": "2021-07-10T12:17:13.348598Z",
     "shell.execute_reply.started": "2021-07-10T12:16:56.330036Z"
    }
   },
   "outputs": [],
   "source": [
    "df_test[:400000].to_csv('test1.csv',index=False)\n",
    "df_test[400000:800000].to_csv('test2.csv',index=False)\n",
    "df_test[800000:1200000].to_csv('test3.csv',index=False)\n",
    "df_test[1200000:1600000].to_csv('test4.csv',index=False)\n",
    "df_test[1600000:2000000].to_csv('test5.csv',index=False)\n",
    "df_test[2000000:].to_csv('test6.csv',index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:31:47.968165Z",
     "iopub.status.busy": "2021-07-10T12:31:47.967621Z",
     "iopub.status.idle": "2021-07-10T12:31:47.985506Z",
     "shell.execute_reply": "2021-07-10T12:31:47.984372Z",
     "shell.execute_reply.started": "2021-07-10T12:31:47.968116Z"
    }
   },
   "outputs": [],
   "source": [
    "tstlst=df_test.select_dtypes(include=['float64','int64']).columns.to_list()\n",
    "tstlst.append('question1')\n",
    "tstlst.append('question2')\n",
    "tstlst"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:20:28.776234Z",
     "iopub.status.busy": "2021-07-10T12:20:28.775688Z",
     "iopub.status.idle": "2021-07-10T12:28:44.272937Z",
     "shell.execute_reply": "2021-07-10T12:28:44.27069Z",
     "shell.execute_reply.started": "2021-07-10T12:20:28.776185Z"
    }
   },
   "outputs": [],
   "source": [
    "df_tmp=pd.read_csv('test1.csv')\n",
    "extract_feat(df_tmp)\n",
    "extract_basic_distance_feat(df_tmp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:32:27.899845Z",
     "iopub.status.busy": "2021-07-10T12:32:27.899423Z",
     "iopub.status.idle": "2021-07-10T12:32:27.950247Z",
     "shell.execute_reply": "2021-07-10T12:32:27.949227Z",
     "shell.execute_reply.started": "2021-07-10T12:32:27.899808Z"
    }
   },
   "outputs": [],
   "source": [
    "tstlst=df_tmp.select_dtypes(include=['float64','int64']).columns.to_list()\n",
    "tstlst.append('question1')\n",
    "tstlst.append('question2')\n",
    "tstlst"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:32:49.899867Z",
     "iopub.status.busy": "2021-07-10T12:32:49.899493Z",
     "iopub.status.idle": "2021-07-10T12:32:59.025927Z",
     "shell.execute_reply": "2021-07-10T12:32:59.024911Z",
     "shell.execute_reply.started": "2021-07-10T12:32:49.899838Z"
    }
   },
   "outputs": [],
   "source": [
    "df_tmp=df_tmp[tstlst]\n",
    "df_tmp.to_csv('test_feat1.csv',index=False)\n",
    "del df_tmp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T12:35:08.42882Z",
     "iopub.status.busy": "2021-07-10T12:35:08.428401Z",
     "iopub.status.idle": "2021-07-10T13:16:00.814715Z",
     "shell.execute_reply": "2021-07-10T13:16:00.813835Z",
     "shell.execute_reply.started": "2021-07-10T12:35:08.428785Z"
    }
   },
   "outputs": [],
   "source": [
    "for j in range(5):\n",
    "    i=j+2\n",
    "    df_tmp=pd.read_csv('test'+str(i)+'.csv')\n",
    "    extract_feat(df_tmp)\n",
    "    extract_basic_distance_feat(df_tmp)\n",
    "    tstlst=df_tmp.select_dtypes(include=['float64','int64']).columns.to_list()\n",
    "    tstlst.append('question1')\n",
    "    tstlst.append('question2')\n",
    "    df_tmp=df_tmp[tstlst]\n",
    "    df_tmp.to_csv('test_feat'+str(i)+'.csv',index=False)\n",
    "    del df_tmp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:16:14.254149Z",
     "iopub.status.busy": "2021-07-10T13:16:14.253603Z",
     "iopub.status.idle": "2021-07-10T13:17:20.376156Z",
     "shell.execute_reply": "2021-07-10T13:17:20.375184Z",
     "shell.execute_reply.started": "2021-07-10T13:16:14.254115Z"
    }
   },
   "outputs": [],
   "source": [
    "del df_test\n",
    "df_test=pd.concat([pd.read_csv('test_feat1.csv'),\n",
    "                   pd.read_csv('test_feat2.csv'),\n",
    "                   pd.read_csv('test_feat3.csv'),\n",
    "                   pd.read_csv('test_feat4.csv'),\n",
    "                   pd.read_csv('test_feat5.csv'),\n",
    "                   pd.read_csv('test_feat6.csv')])\n",
    "df_test.to_csv('test_feat.csv',index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:17:50.479783Z",
     "iopub.status.busy": "2021-07-10T13:17:50.47937Z",
     "iopub.status.idle": "2021-07-10T13:17:51.656576Z",
     "shell.execute_reply": "2021-07-10T13:17:51.655489Z",
     "shell.execute_reply.started": "2021-07-10T13:17:50.479748Z"
    }
   },
   "outputs": [],
   "source": [
    "df_test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.4 相关性特性提取\n",
    "在创建模型之前，我们应该先看看一些功能有多强大。我将从基准模型的wordshare特性开始。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:17:53.768529Z",
     "iopub.status.busy": "2021-07-10T13:17:53.768177Z",
     "iopub.status.idle": "2021-07-10T13:18:05.815723Z",
     "shell.execute_reply": "2021-07-10T13:18:05.814451Z",
     "shell.execute_reply.started": "2021-07-10T13:17:53.768498Z"
    }
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "from nltk.corpus import stopwords\n",
    "\n",
    "stops = set(stopwords.words(\"english\"))\n",
    "\n",
    "def word_match_share(row):\n",
    "    q1words = {}\n",
    "    q2words = {}\n",
    "    for word in str(row['question1']).lower().split():\n",
    "        if word not in stops:\n",
    "            q1words[word] = 1\n",
    "    for word in str(row['question2']).lower().split():\n",
    "        if word not in stops:\n",
    "            q2words[word] = 1\n",
    "    if len(q1words) == 0 or len(q2words) == 0:\n",
    "        # The computer-generated chaff includes a few questions that are nothing but stopwords\n",
    "        return 0\n",
    "    shared_words_in_q1 = [w for w in q1words.keys() if w in q2words]\n",
    "    shared_words_in_q2 = [w for w in q2words.keys() if w in q1words]\n",
    "    R = (len(shared_words_in_q1) + len(shared_words_in_q2))/(len(q1words) + len(q2words))\n",
    "    return R\n",
    "\n",
    "plt.figure(figsize=(15, 5))\n",
    "train_word_match = df_train.apply(word_match_share, axis=1)\n",
    "plt.hist(train_word_match[df_train['is_duplicate'] == 0], bins=20, density=True, label='Not Duplicate')\n",
    "plt.hist(train_word_match[df_train['is_duplicate'] == 1], bins=20, density=True, alpha=0.7, label='Duplicate')\n",
    "plt.legend()\n",
    "plt.title('Label distribution over word_match_share', fontsize=15)\n",
    "plt.xlabel('word_match_share', fontsize=15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这里我们可以看到，这个特性有相当大的预测能力，因为它善于区分重复问题和非重复问题。有趣的是，它似乎很善于识别那些完全不同的问题，但却不善于发现那些完全重复的问题。\n",
    "\n",
    "### 5.5 TF-IDF相关特征提取\n",
    "\n",
    "我现在将尝试通过使用TF-IDF（term frequency inverse document frequency）来改进这个特性。这意味着，我们衡量术语的标准是它们有多不常见，这意味着我们更关心两个问题中存在的稀有词，而不是普通词。这是有道理的，例如我们更关心的是“exercise”一词是否同时出现在这两个词中，而不是“and”一词，因为不常见的词更能表示内容。\n",
    "\n",
    "如果您是自己实现的，您可能想研究使用sklearn的TfidfVectorizer来计算权重，但是由于我懒得阅读文档，因此我将使用纯python编写一个版本，并进行一些更改，我相信这些更改将有助于评分。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:18:09.919384Z",
     "iopub.status.busy": "2021-07-10T13:18:09.919046Z",
     "iopub.status.idle": "2021-07-10T13:18:13.712355Z",
     "shell.execute_reply": "2021-07-10T13:18:13.711367Z",
     "shell.execute_reply.started": "2021-07-10T13:18:09.919355Z"
    }
   },
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "\n",
    "# If a word appears only once, we ignore it completely (likely a typo)\n",
    "# Epsilon defines a smoothing constant, which makes the effect of extremely rare words smaller\n",
    "def get_weight(count, eps=10000, min_count=2):\n",
    "    if count < min_count:\n",
    "        return 0\n",
    "    else:\n",
    "        return 1 / (count + eps)\n",
    "\n",
    "eps = 5000 \n",
    "words = (\" \".join(train_qs)).lower().split()\n",
    "counts = Counter(words)\n",
    "weights = {word: get_weight(count) for word, count in counts.items()}\n",
    "\n",
    "print('Most common words and weights: \\n')\n",
    "print(sorted(weights.items(), key=lambda x: x[1] if x[1] > 0 else 9999)[:10])\n",
    "print('\\nLeast common words and weights: ')\n",
    "(sorted(weights.items(), key=lambda x: x[1], reverse=True)[:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:18:13.715312Z",
     "iopub.status.busy": "2021-07-10T13:18:13.714879Z",
     "iopub.status.idle": "2021-07-10T13:18:13.724575Z",
     "shell.execute_reply": "2021-07-10T13:18:13.723578Z",
     "shell.execute_reply.started": "2021-07-10T13:18:13.715267Z"
    }
   },
   "outputs": [],
   "source": [
    "def tfidf_word_match_share(row):\n",
    "    q1words = {}\n",
    "    q2words = {}\n",
    "    for word in str(row['question1']).lower().split():\n",
    "        if word not in stops:\n",
    "            q1words[word] = 1\n",
    "    for word in str(row['question2']).lower().split():\n",
    "        if word not in stops:\n",
    "            q2words[word] = 1\n",
    "    if len(q1words) == 0 or len(q2words) == 0:\n",
    "        # The computer-generated chaff includes a few questions that are nothing but stopwords\n",
    "        return 0\n",
    "    \n",
    "    shared_weights = [weights.get(w, 0) for w in q1words.keys() if w in q2words] + [weights.get(w, 0) for w in q2words.keys() if w in q1words]\n",
    "    total_weights = [weights.get(w, 0) for w in q1words] + [weights.get(w, 0) for w in q2words]\n",
    "    \n",
    "    R = np.sum(shared_weights) / np.sum(total_weights)\n",
    "    return R"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:18:13.727099Z",
     "iopub.status.busy": "2021-07-10T13:18:13.726488Z",
     "iopub.status.idle": "2021-07-10T13:18:37.763396Z",
     "shell.execute_reply": "2021-07-10T13:18:37.762661Z",
     "shell.execute_reply.started": "2021-07-10T13:18:13.727054Z"
    }
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=(15, 5))\n",
    "tfidf_train_word_match = df_train.apply(tfidf_word_match_share, axis=1)\n",
    "plt.hist(tfidf_train_word_match[df_train['is_duplicate'] == 0].fillna(0), bins=20, density=True, label='Not Duplicate')\n",
    "plt.hist(tfidf_train_word_match[df_train['is_duplicate'] == 1].fillna(0), bins=20, density=True, alpha=0.7, label='Duplicate')\n",
    "plt.legend()\n",
    "plt.title('Label distribution over tfidf_word_match_share', fontsize=15)\n",
    "plt.xlabel('word_match_share', fontsize=15)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:19:04.488104Z",
     "iopub.status.busy": "2021-07-10T13:19:04.487537Z",
     "iopub.status.idle": "2021-07-10T13:19:04.785345Z",
     "shell.execute_reply": "2021-07-10T13:19:04.784212Z",
     "shell.execute_reply.started": "2021-07-10T13:19:04.488069Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import roc_auc_score\n",
    "print('Original AUC:', roc_auc_score(df_train['is_duplicate'], train_word_match))\n",
    "print('   TFIDF AUC:', roc_auc_score(df_train['is_duplicate'], tfidf_train_word_match.fillna(0)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "所以看起来我们的TF-IDF在整体AUC方面实际上变得更糟了，这有点令人失望(我使用AUC度量，因为它不受缩放和类似的影响，所以它是测试单个特征预测能力的一个很好的度量。\n",
    "\n",
    "但是，我仍然认为这个特性应该提供一些原始特性没有提供的额外信息。我们的下一步工作是将这些特性结合起来，并使用它们进行预测。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:19:13.573603Z",
     "iopub.status.busy": "2021-07-10T13:19:13.573206Z",
     "iopub.status.idle": "2021-07-10T13:19:13.582915Z",
     "shell.execute_reply": "2021-07-10T13:19:13.582169Z",
     "shell.execute_reply.started": "2021-07-10T13:19:13.573563Z"
    }
   },
   "outputs": [],
   "source": [
    "import re, string, six\n",
    "\n",
    "from nltk.corpus import stopwords\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')# 以 f开头表示在字符串内支持大括号内的python 表达式\n",
    "\n",
    "def tokenize(s): \n",
    "    return re_tok.sub(r' \\1 ', s).split()#去掉反斜杠的转移机制\n",
    "\n",
    "def clean_text(s):\n",
    "    try:\n",
    "        return re.sub(r'[^A-Za-z0-9,?\"\\'. ]+', '', s).encode('utf-8').decode('utf-8').lower()\n",
    "    except:\n",
    "        return \"\"\n",
    "    \n",
    "def word_count_diff(row):\n",
    "    try:\n",
    "        q1words = len(list(filter(lambda x: x.lower() not in stops, tokenize(row['question1']))))\n",
    "        q2words = len(list(filter(lambda x: x.lower() not in stops, tokenize(row['question2']))))\n",
    "        return abs(q1words - q2words)\n",
    "    except:\n",
    "        return 50"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:19:21.437943Z",
     "iopub.status.busy": "2021-07-10T13:19:21.437414Z",
     "iopub.status.idle": "2021-07-10T13:19:39.934121Z",
     "shell.execute_reply": "2021-07-10T13:19:39.933013Z",
     "shell.execute_reply.started": "2021-07-10T13:19:21.437909Z"
    }
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=(15, 5))\n",
    "train_word_count_diff = df_train.apply(word_count_diff, axis=1)\n",
    "plt.hist(train_word_count_diff[df_train['is_duplicate'] == 0].fillna(0), bins=20, density=True, label='Not Duplicate')\n",
    "plt.hist(train_word_count_diff[df_train['is_duplicate'] == 1].fillna(0), bins=20, density=True, alpha=0.7, label='Duplicate')\n",
    "plt.legend()\n",
    "plt.title('Label distribution over word_count_diff', fontsize=15)\n",
    "plt.xlabel('word_count_diff', fontsize=15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.6 重新平衡数据\n",
    "\n",
    "然而，在我这样做之前，我想重新平衡分类器接收到的数据，因为我们的训练数据中有37%的类是阳性的，而测试数据中只有17%。通过重新平衡数据，使我们的训练集有17%的正概率，我们可以确保分类器输出的概率将更好地匹配排行榜上的数据，并且应该得到更好的分数（因为LogLoss关注的是概率本身，而不仅仅是预测的顺序，如AUC）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:45:58.665778Z",
     "iopub.status.busy": "2021-07-10T13:45:58.665405Z",
     "iopub.status.idle": "2021-07-10T13:45:58.672307Z",
     "shell.execute_reply": "2021-07-10T13:45:58.671556Z",
     "shell.execute_reply.started": "2021-07-10T13:45:58.665749Z"
    }
   },
   "outputs": [],
   "source": [
    "df_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:22:01.58629Z",
     "iopub.status.busy": "2021-07-10T13:22:01.585889Z",
     "iopub.status.idle": "2021-07-10T13:22:01.678509Z",
     "shell.execute_reply": "2021-07-10T13:22:01.67749Z",
     "shell.execute_reply.started": "2021-07-10T13:22:01.586255Z"
    }
   },
   "outputs": [],
   "source": [
    "# First we create our training and testing data\n",
    "x_train = pd.DataFrame()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:22:01.58629Z",
     "iopub.status.busy": "2021-07-10T13:22:01.585889Z",
     "iopub.status.idle": "2021-07-10T13:22:01.678509Z",
     "shell.execute_reply": "2021-07-10T13:22:01.67749Z",
     "shell.execute_reply.started": "2021-07-10T13:22:01.586255Z"
    }
   },
   "outputs": [],
   "source": [
    "x_test = pd.DataFrame()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:46:29.169148Z",
     "iopub.status.busy": "2021-07-10T13:46:29.168792Z",
     "iopub.status.idle": "2021-07-10T13:46:29.203528Z",
     "shell.execute_reply": "2021-07-10T13:46:29.202761Z",
     "shell.execute_reply.started": "2021-07-10T13:46:29.169119Z"
    }
   },
   "outputs": [],
   "source": [
    "x_train=df_train.drop(columns=['id','question1','question2','qid1','qid2','is_duplicate'])\n",
    "y_train=df_train['is_duplicate']\n",
    "x_train['word_match'] = train_word_match\n",
    "x_train['tfidf_word_match'] = tfidf_train_word_match\n",
    "x_train['word_count_diff'] = train_word_count_diff"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T14:10:58.062736Z",
     "iopub.status.busy": "2021-07-10T14:10:58.062224Z",
     "iopub.status.idle": "2021-07-10T14:16:12.864601Z",
     "shell.execute_reply": "2021-07-10T14:16:12.863472Z",
     "shell.execute_reply.started": "2021-07-10T14:10:58.062696Z"
    }
   },
   "outputs": [],
   "source": [
    "x_test=df_test.drop(columns=['test_id','question1','question2'])\n",
    "x_test['word_match'] = df_test.apply(word_match_share, axis=1)\n",
    "x_test['tfidf_word_match'] = df_test.apply(tfidf_word_match_share, axis=1)\n",
    "x_test['word_count_diff'] = df_test.apply(word_count_diff, axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T15:27:34.039731Z",
     "iopub.status.busy": "2021-07-10T15:27:34.039325Z",
     "iopub.status.idle": "2021-07-10T15:27:34.047377Z",
     "shell.execute_reply": "2021-07-10T15:27:34.046385Z",
     "shell.execute_reply.started": "2021-07-10T15:27:34.039698Z"
    }
   },
   "outputs": [],
   "source": [
    "x_train.columns.to_list()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T14:16:59.154118Z",
     "iopub.status.busy": "2021-07-10T14:16:59.153576Z",
     "iopub.status.idle": "2021-07-10T14:16:59.159847Z",
     "shell.execute_reply": "2021-07-10T14:16:59.158832Z",
     "shell.execute_reply.started": "2021-07-10T14:16:59.154086Z"
    }
   },
   "outputs": [],
   "source": [
    "print(x_train.shape,x_test.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:48:52.145685Z",
     "iopub.status.busy": "2021-07-10T13:48:52.145153Z",
     "iopub.status.idle": "2021-07-10T13:49:01.256566Z",
     "shell.execute_reply": "2021-07-10T13:49:01.255476Z",
     "shell.execute_reply.started": "2021-07-10T13:48:52.145651Z"
    }
   },
   "outputs": [],
   "source": [
    "x_train.to_csv(\"train_final.csv\", index=False)\n",
    "y_train.to_csv(\"ytrain_final.csv\", index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T14:17:04.167189Z",
     "iopub.status.busy": "2021-07-10T14:17:04.166836Z",
     "iopub.status.idle": "2021-07-10T14:17:51.140662Z",
     "shell.execute_reply": "2021-07-10T14:17:51.139745Z",
     "shell.execute_reply.started": "2021-07-10T14:17:04.167158Z"
    }
   },
   "outputs": [],
   "source": [
    "x_test.to_csv(\"test_final.csv\", index=False)\n",
    "\n",
    "#y_train = df_train['is_duplicate'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:39:04.201095Z",
     "iopub.status.busy": "2021-07-10T13:39:04.200437Z",
     "iopub.status.idle": "2021-07-10T13:39:05.271Z",
     "shell.execute_reply": "2021-07-10T13:39:05.270203Z",
     "shell.execute_reply.started": "2021-07-10T13:39:04.201041Z"
    }
   },
   "outputs": [],
   "source": [
    "x_train=pd.read_csv(\"train_final.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:49:21.370512Z",
     "iopub.status.busy": "2021-07-10T13:49:21.370124Z",
     "iopub.status.idle": "2021-07-10T13:49:22.04975Z",
     "shell.execute_reply": "2021-07-10T13:49:22.048672Z",
     "shell.execute_reply.started": "2021-07-10T13:49:21.370479Z"
    }
   },
   "outputs": [],
   "source": [
    "pos_train = x_train[y_train == 1]\n",
    "neg_train = x_train[y_train == 0]\n",
    "\n",
    "# Now we oversample the negative class\n",
    "# There is likely a much more elegant way to do this...\n",
    "p = 0.165\n",
    "scale = ((len(pos_train) / (len(pos_train) + len(neg_train))) / p) - 1\n",
    "while scale > 1:\n",
    "    neg_train = pd.concat([neg_train, neg_train])\n",
    "    scale -=1\n",
    "neg_train = pd.concat([neg_train, neg_train[:int(scale * len(neg_train))]])\n",
    "print(len(pos_train) / (len(pos_train) + len(neg_train)))\n",
    "\n",
    "x_train = pd.concat([pos_train, neg_train])\n",
    "y_train = (np.zeros(len(pos_train)) + 1).tolist() + np.zeros(len(neg_train)).tolist()\n",
    "del pos_train, neg_train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:49:32.362276Z",
     "iopub.status.busy": "2021-07-10T13:49:32.361706Z",
     "iopub.status.idle": "2021-07-10T13:49:33.171481Z",
     "shell.execute_reply": "2021-07-10T13:49:33.170603Z",
     "shell.execute_reply.started": "2021-07-10T13:49:32.362227Z"
    }
   },
   "outputs": [],
   "source": [
    "# Finally, we split some of the data off for validation\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, test_size=0.2, random_state=4242)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6 建模LightGBM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:49:36.38153Z",
     "iopub.status.busy": "2021-07-10T13:49:36.380824Z",
     "iopub.status.idle": "2021-07-10T13:49:36.387654Z",
     "shell.execute_reply": "2021-07-10T13:49:36.386675Z",
     "shell.execute_reply.started": "2021-07-10T13:49:36.381473Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import log_loss\n",
    "import lightgbm as lgb\n",
    "\n",
    "# create dataset for lightgbm\n",
    "lgb_train = lgb.Dataset(x_train, y_train)\n",
    "lgb_eval = lgb.Dataset(x_valid, y_valid, reference=lgb_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:49:38.169033Z",
     "iopub.status.busy": "2021-07-10T13:49:38.168448Z",
     "iopub.status.idle": "2021-07-10T13:49:40.098962Z",
     "shell.execute_reply": "2021-07-10T13:49:40.097619Z",
     "shell.execute_reply.started": "2021-07-10T13:49:38.168981Z"
    }
   },
   "outputs": [],
   "source": [
    "# specify your configurations as a dict\n",
    "params = {'boosting_type': 'gbdt','objective': 'binary','metric': 'binary_logloss',\n",
    "          'num_leaves': 37,'learning_rate': 0.89,'feature_fraction': 0.9,\n",
    "          'bagging_fraction': 0.8,'bagging_freq': 5,'verbose': 0}\n",
    "gbm = lgb.train(params,lgb_train,num_boost_round=20,valid_sets=lgb_eval,\n",
    "                early_stopping_rounds=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:49:47.697902Z",
     "iopub.status.busy": "2021-07-10T13:49:47.697509Z",
     "iopub.status.idle": "2021-07-10T13:49:47.766772Z",
     "shell.execute_reply": "2021-07-10T13:49:47.76575Z",
     "shell.execute_reply.started": "2021-07-10T13:49:47.69787Z"
    }
   },
   "outputs": [],
   "source": [
    "y_pred = gbm.predict(x_valid, num_iteration=gbm.best_iteration)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T13:49:51.030324Z",
     "iopub.status.busy": "2021-07-10T13:49:51.029908Z",
     "iopub.status.idle": "2021-07-10T13:49:51.438164Z",
     "shell.execute_reply": "2021-07-10T13:49:51.437374Z",
     "shell.execute_reply.started": "2021-07-10T13:49:51.030289Z"
    }
   },
   "outputs": [],
   "source": [
    "log_loss(y_valid,y_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "得分0.34765581937575907"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T14:17:56.441589Z",
     "iopub.status.busy": "2021-07-10T14:17:56.440991Z",
     "iopub.status.idle": "2021-07-10T14:17:58.169464Z",
     "shell.execute_reply": "2021-07-10T14:17:58.168631Z",
     "shell.execute_reply.started": "2021-07-10T14:17:56.44152Z"
    }
   },
   "outputs": [],
   "source": [
    "y_result=gbm.predict(x_test, num_iteration=gbm.best_iteration)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T14:23:05.316738Z",
     "iopub.status.busy": "2021-07-10T14:23:05.316195Z",
     "iopub.status.idle": "2021-07-10T14:23:05.325236Z",
     "shell.execute_reply": "2021-07-10T14:23:05.323937Z",
     "shell.execute_reply.started": "2021-07-10T14:23:05.31669Z"
    }
   },
   "outputs": [],
   "source": [
    "df_test2=pd.DataFrame()\n",
    "df_test2['test_id']=df_test1['test_id']\n",
    "df_test2['is_duplicate']=0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T14:23:57.433938Z",
     "iopub.status.busy": "2021-07-10T14:23:57.433074Z",
     "iopub.status.idle": "2021-07-10T14:23:57.895031Z",
     "shell.execute_reply": "2021-07-10T14:23:57.894012Z",
     "shell.execute_reply.started": "2021-07-10T14:23:57.433868Z"
    }
   },
   "outputs": [],
   "source": [
    "sub = pd.DataFrame()\n",
    "sub['test_id'] = df_test['test_id']\n",
    "sub['is_duplicate'] = y_result\n",
    "sub=pd.concat([sub,df_test2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-07-10T14:37:32.573512Z",
     "iopub.status.busy": "2021-07-10T14:37:32.573021Z",
     "iopub.status.idle": "2021-07-10T14:37:45.985291Z",
     "shell.execute_reply": "2021-07-10T14:37:45.983934Z",
     "shell.execute_reply.started": "2021-07-10T14:37:32.573473Z"
    }
   },
   "outputs": [],
   "source": [
    "sub.to_csv('submission.csv.zip', index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
